Saturday, July 19, 2014

SaaS Incident Management Best Practices (The STORM™ Series)

"Remember that failure is an event, not a person" (Zig Ziglar)

(this post was first published on July 14, in SaaS Addict)

Shift Happens
No matter how much you prepare, the amount of resources you pour into your operation and the experts you hire, something will break in your production system eventually. (This is not a new subject  - see the Black Swan Event in SaaS Ops and STORM™ for Dummies)

Incidents could manifest themselves as slow response time, loss of a function (e.g. email not being sent, Backup data not FTPed to client), a whole module malfunctioning or a full scale downtime for some or all of your customers.
I have witnessed too many service disasters in my professional career and I have the battle scars to show for it. On the bright side, the STORM™ methodology grew out of the smoldering ruins.

Following is a three piece article discussing how to prepare for, manage and survive your next incident (and it is out there, lurking in the dark right now, rubbing its palms together, waiting to strike).
As in any good apocalyptic novel, this article is divided into the Prologue, the Main Story and the Epilogue. The first article will cover Act I – Before the world came to an end.

Prologue - ACT I
Plan for failure. Expect things to break. With that in mind you should try to anticipate what might break first.
Keep in mind however, that as much as you plan, there is a good chance of something else coming out of a dark corner and biting you in the behind, since you don’t know what you don’t know. Still, for all the other things that you could anticipate this is a good plan.

SPOF Analysis
Single Point Of Failure Analysis is an exercise that should be practiced once a quarter. Without going into too many details the following outlines the principles:
  • Map all your IT assets (and by ALL I mean not just the important stuff like databases and web servers, but also the phone connections to corporate HQ)
  • Perform on each item an Impact Analysis – What would be the Effect of Failure of that component on the service levels
  • Assess the Risk of losing each component. What factors might lead to failure
  • Mitigate the problem by taking pro active measures; e.g. invest in a better storage solution or duplicate the component and load balance between the two.

Monitoring & Alerting
These practices are explored in detail in the STORM™ Event Management chapters. Suffice to say that:
  • Monitoring allows you to detect the problem so that you can fix it
  • Detecting a problematic component allows you to react early on (before your customer calls in, pissed off)
  • Therefore, monitor EVERYTHING – components, processes, communications, workflows, response time, etc.

Escalation Path and Hunt Group
  • Define who gets alerted under what circumstances. E.g. if there is a slowdown on response time for the database, the first person that should be alerted might be the DBA. Map out the various scenarios, who is has ownership and if the event merits escalation.
  • Define the escalation path while you are calmly sipping tea before an event occurs, so those decisions are not made under duress.
  • Define the Hunt Group, policy and call sequence. Most PBXs and telephony services provide this feature.  This link for more info.

Document Changes
As part of the STORM™ Change Management practice, all changes to production should be documented and easily accessible, so that one can query changes immediately in case an incident occurs. (Across location, function, affected customers, Etc.). Remember that some change, somewhere in your production universe will always be the cause of an incident. It may be as obvious as installing a new Web server or as obscure as new browser version on one of your customer’s laptops.

Customer-Asset Mapping
Create a relationship chart of which assets (or group of assets) can be mapped to a customer (or group of customers).  An asset may be the AsiaPac data center in Melbourne or a config file on one of the VMs in AWS.
When the time comes, this will be very handy in both locating the problem and assessing the impact of service degradation. Therefore, saving this relationship in a DB table will prove valuable.

NOC
As part of the STORM™ Event Management practice, a Network Operations Center is always a great advantage. NOCs can be as elaborate as the NASA Mission Control Center, or a loosely coupled web-based dashboard, that presents monitoring data from various sources. A 24X7 manned NOC with 20 giant flat screens usually takes time and continuous investment to reach, but even a small SaaS startup can adopt the concept and start with little steps.

War Room
When the stuff that happens hits the fan, there is a tendency to lose one’s cool. I have witnessed  executives - CEO, VP Sales, major account managers – start running around and bombarding us with questions in the best case scenario, and shouting instructions and blasphemies in the worst case. In order to be able to calmly collect information, assess the situation and act to remediate, there needs to be a place where all those involved in actually resolving the issue, can sit together, isolated from the pressure.
  • Physical War Room– In one of the companies I worked at, I secured a separate room, packed it with a couple of large screens and communications. We prepared a list of exactly who was allowed to enter during an incident. I had the key and we actually locked the door during an incident. We had reps from Support, Ops, and Engineering and did not leave until the situation was under control.
  • Virtual War Room – Whether you have not built the physical War Room or an event occurred after hours, one can create a Virtual War Room using a Web based NOC and a dedicated communication group such as IM or WhatsApp.

Status Page
Create (on the corporate site – not the production site) a page that depicts the current state of your production systems with a time stamp, a color or image code, and a line of text describing the current state. This is somewhat similar to the Salesforce Trust site.


There is a new service out there - StatusPage.io - that provides much of the needed functionality.


As will be discussed in the next article, this will significantly reduce the number of angry customer calls, facilitate communication and help with the Closure phase (third article)

Knowledge Base
Another subject that merits a chapter of its own. However you realize the KB, with home-grown tools, a bunch of articles on your ITSM Help tool, Salesforce Knowledge or a dedicated tool, one must start early on, collecting how-to, tips, known problems, etc. in a central repository, with an easy search and category index. This KB could also contain admin level scripts to allow quick resolution of known issues.With a full implementation of the STORM™ Incident Mgmt, alerts will already have pointers into the relevant articles in the knowledge base.

Prayer
Now that you have done your preparation work, you are in much better shape to reduce the occurrence of a catastrophic event and have the ability to respond and remediate faster. All that is left for you is to pray that it doesn’t happen at night, or at least not on a weekend night, or at least not on a holiday weekend night, or at least not when you are on your way out of the house, with the suitcases packed, on your way to your annual vacation.


The next post discusses The Main Story -  best practices during an incident.