Thursday, March 18, 2010

Change Management and the Sanctity of Production

“Most people are afraid of change. We love it!” (Sign of a beggar on a street in San Francisco, 2004)

(Note: This article is part of the STORM™ methodology)

Change is the greatest cause for service interruption in any IT operations. Period. Stop. Exclamation mark.
In the extreme, one might argue that change is the cause for every service interruption if one counts hardware or software malfunctions as a change as well.
SaaS operations tend to suffer more from a lack of proper change management for two reasons: First, the consequences of a service outage for a company whose entire existence depends on its service, is dire. Second, SaaS engineers, as I have written in numerous posts, lack the discipline that is more inherent in IT departments.

And yet, my experience has been that in most SaaS companies, changes are unsupervised, undocumented, unauthorized, unplanned, (sometimes unnecessary), underestimated and un_____ (fill in the blanks).

The importance of Change Management cannot be overstated and it is the first practice that I have implemented at companies that I worked at (or for). I wince when I recall the casualness which I have witnessed at various SaaS companies about making changes in the production system. I can quote my former boss, Mansur Salame, CEO of Contactual, saying that “production should be treated as sacred, with utmost respect appropriate to holy places” (or something of the sort). And, boy, were we sacrilegious back in those days!

In the Chapter on Change Management in the upcoming book, I will present my STORM™ adaptation with much detail.

In this post I will outline some guidelines for sane change management.

Sixty Seconds on Change Management
Below are listed the objects that comprise a comprehensive Change Management practice.

RFC – Request for Change document. Must initiate the process, regardless how small or major the changes are. It should include the what, the why, the when, the risk, the potential impact (on customers or components), and a checklist of notifications and tests that should/should not be done.

A Change Window must be defined, clearly notating what type of changes to what subsystems are allowed at which days, during what hours.

Change Calendar – which might be implemented in a number of static or automatic formats, must represent the ‘Change Window’, and depict all planned changes by the company, service providers and customers, and must be part of the RFC process.

Change Advisory Board or CAB is the pre-determined group or people who scrutinize and approve the RFC. The CAB may be large or small but it should include at least one person who is not involved in the request and planning process. The CAB may meet on a recurring schedule or as needed.

A Change Record is a record describing a change that occurred in production or the eco-system. It could be implemented as a database or excel or within a ticketing system. It should include the what, when and impact (on customers or components). Much important information could be derived from this data store that pertains to the Incident and Availability Management practices.

The Maintenance Plan is a detailed document defining the pre-requisite tasks, the maintenance tasks, rollback tasks and post-maintenance tasks. Each task should have a description, an owner, a time and duration. In most cases the plan must be scrutinized to the lowest detail level, and practiced in a Pre-Production environment that should mimic the production environment as much as possible.

The sixty seconds are over. This was just a teaser. Obviously there are templates, workflows and a sleuth of details that tie all of these objects into a well-oiled practice. The book will expand on the details and include the workflows, the templates and methods for automation.
To summarize; as the market matures and competition thrives, the big differentiator will be the second ‘S’ in SaaS and customers will become less and less forgiving. If a SaaS company does not practice a robust Change Management practice it will end up paying in a frustrated staff and customer churn.