Job Description :
- The Team Lead provides leadership and management for the Alert Response Team (Monitoring/Alert Service, L1 Application Support). The role is an intensely operational one, requiring the incumbent to be conversant with the technologies like ERP/CRM, SQL, 3 tier architecture, Java and enterprise service management tools like Service Desk, Manage Engine, Salesforce, Remedy, & Service Now and tools used in a modern NOC/Enterprise Command Centre.
- The Team Lead also provides input into the evolution of the Service Delivery model, driving the adoption of the ISO20000 Service Management standard. The ART Team perform 24x7 monitoring of application availability, performance and initial triage of Product related incidents.
- Actively monitor the Hosted/Cloud Customer Application & Infrastructure Environment.
- Reactively resolve any incidents that occur and ensure these are responded to promptly, professionally and within agreed SLAs
- Perform scheduled changes validations within specified maintenance or change windows ensuring desired application behavior and functionalities are thoroughly tested.- Complete regular operational activities to maintain environments with minimal disruption and higher availability.
- Proactively Monitor Infrastructure and Application Components to improve customer experience.- Drive Automation and Develop monitoring tools for effective monitoring by eliminating NVA.
- Core to this role is driving : Strict SLA compliance at all stages (first response, updates, service restore, service repair, etc.)
- Process adherence across the team
- Training, coaching, mentoring and knowledge capture & exploitation to attain and maintain the highest possible levels of technical competency and capability
- Continual performance improvement : including increasing the proportion of incidents that are resolved at L1 (fixed on first contact), driving down time-per-incident and touches-per-incident, automation.
Responsibilities :
- Ensure adequate coverage for the 24/7/365 environment so that the Co's Cloud & Hosted environments are running effectively at all times which includes continuous monitoring of all services, response/resolution of incidents and ensuring that every reasonable effort is made to restore service as quickly as possible in the event of an outage.
- Coach and mentor senior and junior technicians to ensure personnel are working issues as efficiently and accurately as possible in a team-oriented professional culture.
- Meet very aggressive environment availability, MTTR, and ticket handling objectives while providing hands-on leadership during application events.
- Develop, refine, and document monitoring policies, processes, procedures, and associated systems requirements and drive their implementation and use as per pure ITIL V3 Framework and Co's processes.
- Work with various Customer facing support functions to support and provide operational and application validation services thereby improving CSAT.
- Develop and report on metrics for the performance of the Product and individual employees, including but not limited to MTTR, # of escalations, and tickets.
- Drive technical staff to monitor and resolve customer issues and accomplish objectives by being a role model to impress upon staff to be self-motivated with the ambition to problem-solve, learn and create their own mechanisms for resolving issues and most importantly, communicate effectively with the ability to work in a team environment.
- Willing to take ownership and take on other duties as assigned.
- Should have experience of working as Incident Manager for Cloud & Managed service environments
- Work with ART Manager and Leads to assign responsibilities and ensure that all duties are completed in a timely and professional manner
- Build fundamental scalability into the ART through automation of repetitive tasks, elimination of unnecessary work-cycles
- Day-to-day support of direct reports including frequent 1:1 meeting, performance reviews: half and full-year, monthly KPI feedback and quarterly R&R recommendations, escalating wherever necessary
- Drive behavior and practices around structured trouble-shooting methodologies with the clear imperative to restore service first
- Drive technical and professional competency development in ART to handle additional proactive services and create value differentiator for customers
- Escalation point for internal and external Customers for all team matters
- Manage the rota of team/resources and handle shortfalls from the available resources.
- Develop positive relationships across the Team and Company to ensure communication is open and transparent, facilitating early identification of issues and risks and agile exploitation of business opportunities
- Implement processes to measure, manage, standardize and improve the key activities in the ART (Alert & Response team).
- Strong experience working with Incident, Problem, Change and Service Request ticketing systems
- Experience with Structured Troubleshooting methodologies
Qualification :
- Extensive experience in a managing Enterprise command center or NOC for application and infrastructure environment involving a geographically distributed support model including 3+ years of experience at managing people and handling people management responsibilities.
- Must have managed 24x7 infrastructure operation for US-based organization
- Hands-on Experience in Incident handling and Operations : SLA, Escalation and Notification, Workload tracking.
- Experience in overseeing Global Command Center or NOC/SOC operations to support Customer Hosted Infrastructure and/or server administration
- Strong background in enterprise management tools like Manage Engine, Service Now, Tivoli, CA Unicenter, BMC, HP Open view/Operation Center would be an advantage.- Experience of managing help desk workflows, processes and SLA matrix using tools like Service Desk, Remedy or similar tools
- Experience working with a geographically distributed team with different cultures is a plus.
- Well-versed in Windows, ERP/CRM, SQL, 3 tier architecture, Java, & JBOSS.
- Proven strong interpersonal skills. Strong verbal and written communication skills. Collaborates with colleagues across the organization to get things done.
- Must be able to demonstrate experience in making final decisions on administrative or functional activities of a Co's Cloud System Operation Center.
- Experience working in and with cross-functional teams.
- Certifications in ITIL (IT Service Management), MS SQL, MCP, PMP preferred
- Strong problem solving and troubleshooting skills required
- Ability to identify, isolate and analyze Application, Infrastructure (Server, Database) & Network related incidents and operational processes and then drive corrective/preventative action plans working with required stakeholders (internally/externally).
- Strong working knowledge on ITIL (Incident / Problem / Change / Availability Management) and carry the knowledge of managing true SaaS, Cloud, and Hosted Operations environment.
- Broad knowledge of labor-management, ERP, or similar domain, product/systems.
- Exhibit leadership qualities and earn the respect this empowered position requires.
- Ability to multi-task and prioritize projects, time management, and practice detail-oriented organizational skills.
- Experience preparing and writing demonstrations, proposals, policies, procedures, job descriptions, and schedules.
- Self-motivated; ability to maintain excellence in service with minimum supervision.
- Use of good judgment and a sense of urgency in the decision-making process when assessing problems/situations.
- Prior experience in supporting customer applications with Windows, Linux Red-Hat environments, including JBOSS and Java.
- Experience in scheduling, preparing presentations and status reports.