Wednesday, February 05, 2014

STORM™ For Dummies

This is the first of a two-post article. As I promised in my previous post, Taking the Cloud by STORM™, I will elaborate here on the various practices.

Change Management
Curiosity killed the cat? Change is what killed the cat – I’m talking about Schrodinger’s cat, and if you don’t know what I’m referring to, just keep in mind that a tiny change in a subatomic particle is what caused the cat’s permanent downtime.
In a broad sense, Change is what causes all service degradation, whether full downtime or bad response times. In other words, “if it ain’t broke, don’t fix it”. But changes are always needed in your SaaS environment, whether adding another web server to the load balancer, changing a configuration file or upgrading the software overnight. Add to that the fascinating fact that between 60%-70% of service disruption is caused by the human factor.
So, Change must be done very carefully. There are so many things that can go wrong, that one needs a disciplined routine that takes as many factors into consideration as possible, and done in a rigorous manner, following precise instructions. It includes the Change Request, the Change Calendar and Window, the CAB, the Operations Maintenance Plan, the SOPs and the Change recording. This is what Change Management is all about.
 In my many years in the field of SaaS Service Operations I have witnessed all the possible horror scenarios at SaaS companies where Change Management was at best very meager, and usually nonexistent.

Asset Management
Changes are done to Assets. An Asset could be as large as a datacenter (new router that affects all boxes), it could be a database cluster (upgrade of the Oracle RAC) or as small as a JPEG file embedded in an HTML document. In an ideal world, a SaaS company would deploy a CMDB (Configuration Management Database) with a team of administrators that do nothing but maintain it. But in the real world, as SaaS companies grow from two to four to ten boxes and beyond, the assets are maintained, in a best case scenario, on an excel  worksheet on the sys-admin’s laptop.
When a change is performed on an asset there is a need to understand the possible impact on other assets, or on customers, and it needs to be documented. An Asset Management solution records each asset with relevant information (name, IP address, location, function, make etc.), records the relationship to other assets and maps that asset to a customer (or group of customers). When change is performed to an asset, the Change Management system will point to the relevant asset and record the history of modifications to that asset.

Event Management
With the understanding that changes to assets cause bad vibes, we need to look out for changes and understand their effects. That is done mainly by monitoring and responding to reported changes. Look at as many attributes of your assets and detect when things have changed. High CPU, low disk space, an OS service that has stopped functioning, long response times, etc. But detecting the change is not enough; how do you respond to the changes, when do you determine that a change is cause for trouble? How much automation should you employ? (short answer: a lot). What do you do when a threshold is reached? How do you control Alert Overloading? How do you build Escalation into the alert mechanism?  This is what Event Management deals with.

Incident Management
OK, so you are now practicing a robust Change Management regime, with all your Assets accounted for and monitored with a good understanding of impact of changes to your service. You are in much better state than before, but will service disruptions still happen? Of course they will! They will bite you when and where you least expect them to. What to do! What to do?
One option is running around hysterically screaming “the sky is falling” and shooting in all directions, while making sure you cover your behind in anticipation of the blame game that will surely take place after the crisis is over.
Another option is to implement Incident Management. IM covers all aspects from Detection, through Recording, Classification, Notification, Escalation, Diagnosis, Resolution, to Closure. All in a cool headed manner, following a well trained routine that will allow the company to analyze the incident and apply lessons learned so that you will not face this particular crisis again.

SLA Management
And then, some (or all) of you customers will require that you adhere to your SLA; but who knows if it was breached? Sometimes it is very clear- the system was down for an hour - everybody was affected and the SLAs were breached. But many times, it could be degradation of the service, or partial downtime. Who knows which customer was affected by what and for how long? This is where SLA Management comes in, and why it is so important to have Incident Management in place, so that one can compare the actual damage with the particular SLA. The other option is going through a bunch of printed documents (if you can even find them), trying to figure out for each customer what their agreement covered, and if it was breached, should they be compensated and how much.



In the next post I will discuss Communication Management, Release Management (DevOps), Operations Intelligence and The Dummies.

No comments: