Being a service practitioner you design services that that are resilient and available when requires. A important discipline to to ensure you set up a monitoring regime that alerts you to anything untoward.

Although services may slow down or fail due to incidents caused through component failures or beyond your scope of monitoring.

When this happens you need to be sure that you can quickly investigate and diagnose root causes as quickly as possible.

To ensure you are fully prepared, you obviously will have to have the appropriate technical monitoring and debugging tools in place, but experience has taught us that services that are restored quickly are as a result of a human power.

Human power relates to having the right roles in place and allocating responsibilities to them.

When services start to degrade or suddenly fail and your team establish the cause immediately, undertake the following:

Introduce a Major Incident Triage of Responsibilities, which consist of the following managers:

 1.Lead Manager role: This person will coordinate all staff and resources to ensure service is restored and is accountable for the service restoration. This is a high profile role that will protect and require the support and information from the Communication Manager and Technical Manager roles. This role will ensure communications to key stakeholders is maintained and that progress is made in establishing the technical solution. This role will be able to pull on resources, staff and money. Following the restoration of the service outage this role will also prepare and chair a Service Outage Analysis meeting to establish how to present this loss from happening again and to build on improving processes to follow in the event of future outages.

2. Communications Manager role: This role will gather all intelligence, prepare, draft and seek approval from the Lead role before communicating the issue, progress and service restore notice. This role can recruit a team to assist and maintains the authority during a major incident to do this. To assist the complete restoration of service, this roll will also start, update and maintain an Incident Log which will illustrate what is happening? Who is involved? and what the outcome is expected and  what actually happened.

3. Technical Manager role: This role will be responsible for undertaking all technical diagnostics, investigations and restoration of services. This role can recruit a team (of geeks or techies) to assist and maintains the authority during a major incident to do this..


Photo by volkan akyüz from FreeImages