Thursday, August 07, 2014

SaaS Incident Management Best Practices - Act I (The STORM™ Series)

"It is not stress that kills us; it is our reaction to it" (Hans Selye)

This is the second posting of the Incident Management Best Practices. The first part covered the Prologue - the preparations that will improve your survival rate for the next incident that is just waiting to happen.
(This post first appeared on SaaS Addict earlier this week)
The main story contains two acts. This, the first one, deals with the reactive part of the incident. The next post, Act II, will deal with the proactive part of managing an incident.

How Soon?
⦁    How soon did you find out there was a problem?
⦁    How soon did you inform your customers?
⦁    How soon did you know the extent of the problem?
⦁    How soon did you know the impact?
⦁    How soon did you start handling the incident?
⦁    How soon did you get a workaround functioning?
⦁    How soon did you resolve it?
⦁    How soon were you customers informed about the resolution and causes?

Incident management is all about minimizing the damage, doing so in the shortest time possible and taking steps to improve the service going forward.
If you followed the practices recommended in the previous post, you will be much better prepared to deal with whatever hits you, be it as small as a weekly report not being produced  or as bad as a full crash of the system.

ITIL Incident Management
This post does not aim to replace an ITIL certification course (STORM™ was inspired by ITIL and many of the terms are borrowed from ITIL/ITSM), but it follows, to an extent, the activities as they appear in much of the ITSM literature.
The idea behind this approach is to keep a leveled head, to ensure that all details are captured and to recover ASAP.

The stages are:
  • Detect
  • Record
  • Classify
  • Notify
  • Escalate
  • Investigate
  • Restore

Detection - Initiation of the process
When an incident is detected in the organization, it could originate from a monitoring system that alerts the staff just as it happens. In the best case scenario, the problem could be resolved before most customers are even aware of a problem. Alas, the incidents we are discussing usually don’t have this “lived happily ever after” ending.

In reality, many of the really bad problems are those you did not anticipate, and you are made aware of them by a customer calling in, or perhaps a member of the staff noticing something wrong while doing routine work or demoing the product.Often your monitoring system will alert you on an issue that is derived from the real problem, without detecting the problem itself. These cases are misleading and will result in a longer time to resolution.

Regardless of how the issue was detected, the process must begin at the same single starting point - the helpdesk or its equivalent in the company.  It may be a dispatcher, or the support person who is on call at home.

Recording - Keeping track of events

When a nasty problem arises, it may be very low on your priority list to do clerical work, so one may be tempted to postpone this activity to a later stage.
It will become crucial later to have that information and therefore it is important that you pause for a minute and record the information. You might scribble it on a piece of paper to be later entered into the system or entered directly, but capture the following:
  • Date and time of incident first recorded
  • Means of detection (customer call, internal monitoring alert, external monitoring, vendor call, internal staff, etc.)
  • Manifestation – how did the problem appear at first
Recording should continue throughout the incident. The Status Page allows capturing some of the information.

Classification – Determine the impact on customers / components and the urgency.
When the single datacenter is down, it will be rather simple to determine impact and urgency, but many a time, only some customers are affected, or only some functionality is missing.

It is highly important to determine the impact of the incident to allow a proper reaction. Say, you have an affected production system in the UK and your product is mainly a 9-to-5 solution, and it is evening in Europe. The urgency will not be as high as if your East Coast production was down and it is 11:00AM EST right now.
Perhaps the synchronized reporting database is not accessible. It is not as bad as if the transaction database was out of commission, basically shutting down your operation.
The classification will determine many factors in how you manage the incident and therefore it is paramount that you don’t get it wrong.
Remember the ‘customer/component mapping’ from the Prologue posting? This is a good time to utilize it to determine affected component or customers.

The next post, ACT II, will deal with the more proactive aspects, namely Notification, Escalation, Investigation and Restoration

No comments: