Thursday, August 28, 2014

SaaS Incident Management Best Practices - Epilogue (The STORM™ Series)

“If you want a happy ending, that depends, of course, on where you stop your story.” (Orson Welles)

This is the final post, covering Incident Management in a SaaS Operational Environment.

The previous posts discussed the Prologue, Act I and Act II which covered the preparations and how to react and then how to act in an incident.

As much as one would like to sit back and sip a cool Coke/beer once the incident is over and service is restored, ”it ain’t over till I’s over”. There is much work to be done to get closure.

Notify- Update concerned parties of service restoration

All parties - internal, partners and customers - should be notified that the incident is over.
The first action is to post an all-clear message on the Service Status Page.
Use the same mailing lists (internal and external) that were used to notify of the incident, to spread the good word that service is restored.

Update “Top 10” customers. The concept of the Top 10 is used in other STORM™ practices and is used metaphorically, since the list may contain only seven or twenty seven customers. This is a small group of key functionaries at a select collection of customers with whom the company has developed special relations. These customers tend to be the larger ones or the most profitable ones, or strategic in one way or another. These key players should be handled with Tender Loving Care. Following some incidents, the executives of the SaaS provider would be making the calls if the problem was severe enough and the customer is important enough. The Top 10 must be updated ASAP, so that they won’t hear from other sources of the issues and the resolution. There is a good chance that if the incident was prolonged, some members of the Top 10 were already contacted at the early stages, in the Notify phase.

Record Incident 

Recording the incident has a number of goals/benefits.

  1. Allows the company to manage the SLAs in order to determine who was affected and for how long
  2. An important KPI – allows the company to measure progress across time
  3. Used in the Review and simplifies the creation of the RFO (below)
  4. Enables the company to improve across processes, components and Incident Management in the future.
  5. An important datum of OI – Operations Intelligence - to be used in analytics, prediction and cost cutting.

Each incident must be documented with as much detail as possible and should include the following:

  • Events leading to outage 
  • Time-line of events and actions
  • Components that were affected 
  • Customers, or customer groups that were affected 
  • Resolution - what was done to resolve the problem (may include technical data such as commands that were run) 
  • Indication of full/partial/no outage 

NOTE Any changes to components in the system must be reported to the Asset Management Database and Change Management tables

Review - Post Mortem

The value of the incident review cannot be overestimated. The staff attending the Post Mortem should include anyone that was involved in the incident, or at least a representative from that function. E.g. if multiple CSRs were involved, or multiple Ops engineers, it will suffice if one member of those groups attends, but that person needs to collect all the relevant information from that group’s perspective.
The review should take place as soon as the relevant information at most two days following the incident.
A successful Post Mortem should result in:

  • A clear view of the events, activity and timeline 
  • Understanding the root cause
  • Understanding the damages
  • Analyzing what worked out and what did not during the incident
  • Verifying that Known Problems and Workarounds are updated on the Knowledge Base
  • Verifying that all notifications and customer facing activity has been performed
  • Remediation steps
  • Lessons learned and what should be done differently next time
  • Agreement on the messaging and/or distribution of the RFO (below)


RFO – Reason for Outage

The final step in the closure is filling out and sending the RFO, using a pre-defined template.
The RFO is a document, expected by the customers describing the Incident, the cause and what was done to minimize future re-occurrences.
Sending out the RFO should be done with careful thought. First, the wording and messaging is important. One needs to be as transparent as possible without seeming like complete fools. Also, if only a portion of the customers were affected, perhaps is it wise not to advertise the service degradation to those who had no idea that anything was amiss. This decision is left to company policy or decided per incident at the Post Mortem.

The RFO should include these fields:

  • Meaningful title
  • Date of Outage
  • Time and duration
  • Incident description
  • Affected services
  • Root cause
  • Resolution
  • Next steps

This is the fourth and last article of the Incident Management. The book, when published, will contain more details and useful templates.

May the STORM™ be with you.





Wednesday, August 13, 2014

SaaS Incident Management Best Practices - Act II (The STORM™ Series)

“A thought which does not result in an action is nothing much, and an action which does not proceed from a thought is nothing at all” (Georges Bernanos)


This post is the third, covering Incident Management in a SaaS Operational Environment.
(This post first appeared earlier this week on SaaS Addict)

The previous post covering the initial activities of the incident, discusses the more reactive tasks, namely Detection, Recording and Classification. This post will discuss the proactive stages leading to resolution.

Notification – Inform everybody of the incident.

There are three groups that must be made aware of the incident as soon as it is classified:

  • Internal staff. A predefined list of who gets notified within the company must exist. Whether it is done via email, chat, whatsapp, phone call or carrier pigeon should be determined (ahead of time) according to the classification (urgency and impact). You do NOT want a situation where a major customer informs the sales rep of a problem.
  • Customers. Sometimes the classification of the problem would determine that there are no impacted customers right now and that service could be restored shortly. In this case there is no advantage of creating mass hysteria. The Status-Page (as described in the first post) should be updated first. Now, depending on a many circumstances there are options of sending out an email to all customers, affected customers, highly valued customers, etc. Under a certain set of rules, account managers may call their customers to inform them personally. If the application (is not down and) has a notification box, this a good opportunity to inform actual users of problems.
  • Partners / Channels. Don’t forget your partners. Sometimes in the heat of an incident they are not notified. It may affect them and their customers.
The points I am trying to nail are:
  1. Do not risk having customers discover on their own that there are problems – if they are likely to find out, make sure you are the one informing them.
  2. Try to determine all this activity prior to the incident, not while you’re in the middle of it.

Note: Status Page
This is part of the Notification process, but it merits its own section.
The first Status Page I implemented was at a SaaS provider whose service was business critical. Before we implemented it, each event, real or imaginary, would generate hundreds of calls to the support center. The lines would clog up and the customers would leave angry or frustrated messages. They would try again later and still get the ‘please leave your message’. After the event was over the exhausted CSRs would have to open a helpdesk ticket for every recorded message, and call back the users. This was not only wasted effort and time consuming, but we ended up with many frustrated customers.

Once the Status Page was implemented, it took a few weeks to get the customers used to checking it out and the amounts of calls we got during an incident was reduced by two orders of magnitude!
Keep in mind that the Status page should be updated regularly, with a timestamp attached. Any information that can be provided to the customers will boost their confidence and give them a sense of how soon the problem would get resolved.

Escalate - Get the relevant people working on the problem ASAP

Having planned the Escalation Path in advance, as recommended in the previous post, this should be a straightforward process. Some issues may be resolved by a level-1 operator, but assume that in major incidents everybody will be involved. It is important to stick with the escalation path not to hinder the Investigation process.

It is imperative that an Incident Manager be assigned to the particular event. It may be decided in advance or ad-hoc. The IM gathers the relevant staff in the War Room (below) and manages the whole process, assigns tasks, collects information and ensures that the whole process is recorded.

Investigate - Determine the root cause

As this point we should have the following:

  • An assigned Incident Manager
  • People of relevance gathered together in the ’War Room’, whether physical or virtual
  • Understanding of the problem – what is not functioning
  • Understanding of the impact – who is suffering from it and how urgent is it
  • Understanding of the affected component – sometimes it is obvious from the onset that a major component is down, via monitoring or a report from a service provider, but in complex systems this is not always possible. Sometimes a problem in one sub-system will manifest itself as a problem in another dependent sub-system. The Known Problems in the knowledgebase should be very helpful.
  • Using the Component-Customer mapping as described in the previous post, could be helpful to determine to culprit.
  • Assuming you have been following the practices of the STORM™ Change Management, you would have at your fingertips a query of all changes to the system that were done in the last X hours. There is a very high correlation between changes and failure, so that it safe to assume that the problem will become obvious. Keep in mind that changes should include everything in your production domain including your service providers and your customers.
  • Usage of the Knowledgebase, as described in the previous post might point out to similar cases that were encountered in the past.

Note: War Room
As described in the Prologue, a quiet environment, where only people who might contribute to the process, is vital. The war room, should include up-to-date information on all aspects, and allow open communication between all parties. There should be a single entity, the Incident Manager, running the show, gathering information and assigning tasks to the various participants.
It is important to keep out of the room any person who might add unnecessary pressure and the IM should feel confident enough to kick the CEO out of the room if it is deemed necessary.
Remember that a customer support representative is present as well. The CSRs’ job is to report on any new developments from the customers’ point of view and to communicate to the customer base any progress, preferably through the Status Page.

Restore Service - Allow your customers to continue working

While still in the War Room, the process of restoring the service is done. There are usually three options:
  1. Resolving the problem. Sometimes the issue is straight forward and can be resolved with firing up a backup server, restarting an Windows' service, switching to the last reliable version , or even relaunching the application. If there is a high probability that taking such action could bring the service back within minutes (this is open to interpretation), that is obviously the preferable route. A knowledgebase of Known Solutions would be a great asset at this point. Predefined scripts, as part of the KB, would be even better.
  2. Workaround. When the problem is not well understood and there is no guarantee that any remedial action will bring the service up, or even if it does, there is no guarantee that the problem will not reoccur within a short time, there should be a workaround solution. Such a solution might be a temporary one (such as reverting to the last working version, or database) and may include reduced functionality, but it will at least allow the customers to get back to work, until resolving the problem.
  3. Failover. Assuming redundancy across production systems (locations?) or a DR site is available, there is always the option of failing over to the backup service. This is not an easy decision and not without its costs, but if a workaround is not available and resolving the problem at the production site is going to take long, restoring service to your customers is paramount. 

Throughout the whole process the Status Page should be updated, and obviously, when service has been restored to a satisfactory level, that should be communicated. It is up to the Indecent Manager to verify that this is being dome and up to the CSR to perform that. The Incident Manager should not be assigned with tasks herself, and her only responsibility is to make sure that everything is being documented and calmly coordinate the activity in the War Room.

In the next post – the Epilogue – we will look at the events and activities that take place after the service was restored.

Thursday, August 07, 2014

SaaS Incident Management Best Practices - Act I (The STORM™ Series)

"It is not stress that kills us; it is our reaction to it" (Hans Selye)

This is the second posting of the Incident Management Best Practices. The first part covered the Prologue - the preparations that will improve your survival rate for the next incident that is just waiting to happen.
(This post first appeared on SaaS Addict earlier this week)
The main story contains two acts. This, the first one, deals with the reactive part of the incident. The next post, Act II, will deal with the proactive part of managing an incident.

How Soon?
⦁    How soon did you find out there was a problem?
⦁    How soon did you inform your customers?
⦁    How soon did you know the extent of the problem?
⦁    How soon did you know the impact?
⦁    How soon did you start handling the incident?
⦁    How soon did you get a workaround functioning?
⦁    How soon did you resolve it?
⦁    How soon were you customers informed about the resolution and causes?

Incident management is all about minimizing the damage, doing so in the shortest time possible and taking steps to improve the service going forward.
If you followed the practices recommended in the previous post, you will be much better prepared to deal with whatever hits you, be it as small as a weekly report not being produced  or as bad as a full crash of the system.

ITIL Incident Management
This post does not aim to replace an ITIL certification course (STORM™ was inspired by ITIL and many of the terms are borrowed from ITIL/ITSM), but it follows, to an extent, the activities as they appear in much of the ITSM literature.
The idea behind this approach is to keep a leveled head, to ensure that all details are captured and to recover ASAP.

The stages are:
  • Detect
  • Record
  • Classify
  • Notify
  • Escalate
  • Investigate
  • Restore

Detection - Initiation of the process
When an incident is detected in the organization, it could originate from a monitoring system that alerts the staff just as it happens. In the best case scenario, the problem could be resolved before most customers are even aware of a problem. Alas, the incidents we are discussing usually don’t have this “lived happily ever after” ending.

In reality, many of the really bad problems are those you did not anticipate, and you are made aware of them by a customer calling in, or perhaps a member of the staff noticing something wrong while doing routine work or demoing the product.Often your monitoring system will alert you on an issue that is derived from the real problem, without detecting the problem itself. These cases are misleading and will result in a longer time to resolution.

Regardless of how the issue was detected, the process must begin at the same single starting point - the helpdesk or its equivalent in the company.  It may be a dispatcher, or the support person who is on call at home.

Recording - Keeping track of events

When a nasty problem arises, it may be very low on your priority list to do clerical work, so one may be tempted to postpone this activity to a later stage.
It will become crucial later to have that information and therefore it is important that you pause for a minute and record the information. You might scribble it on a piece of paper to be later entered into the system or entered directly, but capture the following:
  • Date and time of incident first recorded
  • Means of detection (customer call, internal monitoring alert, external monitoring, vendor call, internal staff, etc.)
  • Manifestation – how did the problem appear at first
Recording should continue throughout the incident. The Status Page allows capturing some of the information.

Classification – Determine the impact on customers / components and the urgency.
When the single datacenter is down, it will be rather simple to determine impact and urgency, but many a time, only some customers are affected, or only some functionality is missing.

It is highly important to determine the impact of the incident to allow a proper reaction. Say, you have an affected production system in the UK and your product is mainly a 9-to-5 solution, and it is evening in Europe. The urgency will not be as high as if your East Coast production was down and it is 11:00AM EST right now.
Perhaps the synchronized reporting database is not accessible. It is not as bad as if the transaction database was out of commission, basically shutting down your operation.
The classification will determine many factors in how you manage the incident and therefore it is paramount that you don’t get it wrong.
Remember the ‘customer/component mapping’ from the Prologue posting? This is a good time to utilize it to determine affected component or customers.

The next post, ACT II, will deal with the more proactive aspects, namely Notification, Escalation, Investigation and Restoration

Saturday, July 19, 2014

SaaS Incident Management Best Practices (The STORM™ Series)

"Remember that failure is an event, not a person" (Zig Ziglar)

(this post was first published on July 14, in SaaS Addict)

Shift Happens
No matter how much you prepare, the amount of resources you pour into your operation and the experts you hire, something will break in your production system eventually. (This is not a new subject  - see the Black Swan Event in SaaS Ops and STORM™ for Dummies)

Incidents could manifest themselves as slow response time, loss of a function (e.g. email not being sent, Backup data not FTPed to client), a whole module malfunctioning or a full scale downtime for some or all of your customers.
I have witnessed too many service disasters in my professional career and I have the battle scars to show for it. On the bright side, the STORM™ methodology grew out of the smoldering ruins.

Following is a three piece article discussing how to prepare for, manage and survive your next incident (and it is out there, lurking in the dark right now, rubbing its palms together, waiting to strike).
As in any good apocalyptic novel, this article is divided into the Prologue, the Main Story and the Epilogue. The first article will cover Act I – Before the world came to an end.

Prologue - ACT I
Plan for failure. Expect things to break. With that in mind you should try to anticipate what might break first.
Keep in mind however, that as much as you plan, there is a good chance of something else coming out of a dark corner and biting you in the behind, since you don’t know what you don’t know. Still, for all the other things that you could anticipate this is a good plan.

SPOF Analysis
Single Point Of Failure Analysis is an exercise that should be practiced once a quarter. Without going into too many details the following outlines the principles:
  • Map all your IT assets (and by ALL I mean not just the important stuff like databases and web servers, but also the phone connections to corporate HQ)
  • Perform on each item an Impact Analysis – What would be the Effect of Failure of that component on the service levels
  • Assess the Risk of losing each component. What factors might lead to failure
  • Mitigate the problem by taking pro active measures; e.g. invest in a better storage solution or duplicate the component and load balance between the two.

Monitoring & Alerting
These practices are explored in detail in the STORM™ Event Management chapters. Suffice to say that:
  • Monitoring allows you to detect the problem so that you can fix it
  • Detecting a problematic component allows you to react early on (before your customer calls in, pissed off)
  • Therefore, monitor EVERYTHING – components, processes, communications, workflows, response time, etc.

Escalation Path and Hunt Group
  • Define who gets alerted under what circumstances. E.g. if there is a slowdown on response time for the database, the first person that should be alerted might be the DBA. Map out the various scenarios, who is has ownership and if the event merits escalation.
  • Define the escalation path while you are calmly sipping tea before an event occurs, so those decisions are not made under duress.
  • Define the Hunt Group, policy and call sequence. Most PBXs and telephony services provide this feature.  This link for more info.

Document Changes
As part of the STORM™ Change Management practice, all changes to production should be documented and easily accessible, so that one can query changes immediately in case an incident occurs. (Across location, function, affected customers, Etc.). Remember that some change, somewhere in your production universe will always be the cause of an incident. It may be as obvious as installing a new Web server or as obscure as new browser version on one of your customer’s laptops.

Customer-Asset Mapping
Create a relationship chart of which assets (or group of assets) can be mapped to a customer (or group of customers).  An asset may be the AsiaPac data center in Melbourne or a config file on one of the VMs in AWS.
When the time comes, this will be very handy in both locating the problem and assessing the impact of service degradation. Therefore, saving this relationship in a DB table will prove valuable.

NOC
As part of the STORM™ Event Management practice, a Network Operations Center is always a great advantage. NOCs can be as elaborate as the NASA Mission Control Center, or a loosely coupled web-based dashboard, that presents monitoring data from various sources. A 24X7 manned NOC with 20 giant flat screens usually takes time and continuous investment to reach, but even a small SaaS startup can adopt the concept and start with little steps.

War Room
When the stuff that happens hits the fan, there is a tendency to lose one’s cool. I have witnessed  executives - CEO, VP Sales, major account managers – start running around and bombarding us with questions in the best case scenario, and shouting instructions and blasphemies in the worst case. In order to be able to calmly collect information, assess the situation and act to remediate, there needs to be a place where all those involved in actually resolving the issue, can sit together, isolated from the pressure.
  • Physical War Room– In one of the companies I worked at, I secured a separate room, packed it with a couple of large screens and communications. We prepared a list of exactly who was allowed to enter during an incident. I had the key and we actually locked the door during an incident. We had reps from Support, Ops, and Engineering and did not leave until the situation was under control.
  • Virtual War Room – Whether you have not built the physical War Room or an event occurred after hours, one can create a Virtual War Room using a Web based NOC and a dedicated communication group such as IM or WhatsApp.

Status Page
Create (on the corporate site – not the production site) a page that depicts the current state of your production systems with a time stamp, a color or image code, and a line of text describing the current state. This is somewhat similar to the Salesforce Trust site.


There is a new service out there - StatusPage.io - that provides much of the needed functionality.


As will be discussed in the next article, this will significantly reduce the number of angry customer calls, facilitate communication and help with the Closure phase (third article)

Knowledge Base
Another subject that merits a chapter of its own. However you realize the KB, with home-grown tools, a bunch of articles on your ITSM Help tool, Salesforce Knowledge or a dedicated tool, one must start early on, collecting how-to, tips, known problems, etc. in a central repository, with an easy search and category index. This KB could also contain admin level scripts to allow quick resolution of known issues.With a full implementation of the STORM™ Incident Mgmt, alerts will already have pointers into the relevant articles in the knowledge base.

Prayer
Now that you have done your preparation work, you are in much better shape to reduce the occurrence of a catastrophic event and have the ability to respond and remediate faster. All that is left for you is to pray that it doesn’t happen at night, or at least not on a weekend night, or at least not on a holiday weekend night, or at least not when you are on your way out of the house, with the suitcases packed, on your way to your annual vacation.


The next post discusses The Main Story -  best practices during an incident.

Monday, May 19, 2014

Why ITIL is not a great fit for SaaS?


"Maturity is a high price to pay for growing up". (Tom Stoppard)


Last week I visited a customer who is a global leader in their market and that had been gradually, over the past three years, transitioning their software to the Cloud. Ther have an impressive operation and reached a maturity level that required them to take the next step - they are hiring a ITIL consultant to help them formalize their operations. I congratulated them on their initiative, but I raised an eyebrow as well.

There are many IT System Management (ITSM) tools out there; some have been around for decades. Most are ITIL based, and since that is what is available, SaaS Operations managers will sometime use the ITIL-based tools that are available – This, of course, is the best case scenario, since many SaaS Ops are managed ad-hoc, with a mixture of documents,  tools, some homemade scripts and open source solutions.
STORM™ was conceived when following regular ITIL methodology just didn’t do the job. It was either too cumbersome, an overkill, or not accurate enough to cater for the SaaS Ops needs.

Initially, SasS Ops is similar to the IT department in many aspects:
Both teams are running the infrastructure, the applications and the service desk for ‘other’ users, writing scripts, configuring, monitoring, responding and in general, ‘keeping the lights on’.

But there are significant variations that make all the difference.

Quantity vs. Quality
The difference between the ratio "application/users" in the two entities is significant. In a typical enterprise there may be hundreds of applications being used simultaneously by hundreds or thousands of users. In a typical SaaS operation there is a single application, or a very small number of applications that are used by thousands, or tens, or hundreds of thousands of users. On top of that, there typically are a large number of environments and stacks that are being used in the enterprise – numerous kinds of databases, operating systems, middle-ware, network boxes, and functional servers. In the SaaS application environment there is usually a single kind of database, a single OS, a single type of application server, etc.

This distinction demands completely different types of staff, skill sets, tools etc. Whereas the Ops group of a SaaS company will know the application and environment intimately, there is no ability of the IT staff to become as familiar with the applications as they would like. They end up knowing a lot less about a lot more. The complexity of the enterprise environment is therefore a hundredfold less manageable.
And that does not even take into account the provisioning of a multitude of devices that IT is responsible for, including desktops, laptops, phones, printers, scanner, etc. taking even more attention from the application management.

The complexity of an enterprise IT environment might justify a CMDB, but that is an overkill for most SaaS operations that could be managed with a simpler Asset Management solution.

Know thy Customer
As mentioned, a SaaS company will deal with a user base that is one or two (or more) orders of magnitude larger than the IT department. Also, where as a SaaS company will have hundreds or thousands of customers, IT deals with at most ten to fifteen different entities within the organization, and perhaps a number of partners as well. 

Every SaaS company is bound by an SLA with all of its customers (some have more bite than others). The SLAs within the enterprise (if they exist at all) are much less binding than a legal document signed with an external entity, and there seems to be a lowered expectation as well as a more forgiving attitude towards the internal IT organization.  

Therefore, while it is becoming more common for SaaS companies to display a Status Page for the users to view the availability and performance of production systems, it is rather rare In IT organizations. 

Centrality of the Ops group.
In a well managed SaaS company, the Operations group is the hub of the activity. They should be communicating with all the organization’s entities (R&D, QA, Support, Professional Services, Finance, Product and Sales) as delivery partners, not as customers. In most business organizations, IT is treated as a necessary evil; as an internal service provider that is always behind schedule and delivering at low performance; it is rare when IT is viewed as a delivery partner.  Whereas the Ops group is expected to participate in weekly customer success meetings, that practice is virtually nonexistent in an enterprise IT organization.

The operations staff at a SaaS company have the ability to affect the product behavior and “serviceability” (and in some companies they actually are part of the MRD/PRD process), while the IT staff at the enterprise can bitch about the problems at best, but have no influence on what features will be included in the next version. 

For example, the application may be spewing out hundreds of Warning or Error messages that are completely cryptic. The SaaS Ops manager has a responsibility to sit with Engineering and map those messages to actionable items.  In the IT world, this is practically impossible for shrink-wrapped applications. 

Service continuity
In ITIL parlance, Service Continuity deals with the ability to provide minimal service levels, and is designed to support Business Continuity.
In the STORM™ methodology, SaaS Service Continuity plays a completely different role. It is designed to provide the continuation of the service to the customers even when the SaaS provider is no longer an existing entity. This could occur under positive circumstances, when the company is acquired by a larger company or under a dire situation, when the company goes under.

Since most SaaS companies are small, many of them with little or zero profitability, and the consolidation of Cloud providers is a daily occurrence, customers need assurances that their services and data will continue to be available – at least for a substantial period.  Therefore, issues such as Financial Viability, Technical Complexity, Operational Complexity, Escrow and Data portability play a significant role and are rarely the concern of corporate IT departments.

In Short
There are many other day to day factors that differentiate IT Ops from SaaS Ops; the above discussion pointed out some of them. ITIL provides an important, structured framework for any IT operations, and SaaS Ops managers will benefit from familiarizing themselves with the principles and the vocabulary. But ITIL is too complex, cumbersome and high level to be of a practical value to the average SaaS operation. 

STORM™ practices flattens and simplifies the ITIL structure, and provide a practical and specific guide for an efficient and effective SaaS Operation.

Wednesday, April 02, 2014

To be or not to be… in the Cloud. Examining the Motivation

“The undiscovered country from whose bourn no traveller returns” (William Shakespeare)

Yes, it is 2014 and there still are thousands of ISVs who are selling their software in the old, on-premise model.  Surely, many of them are doing quite well, with a large, happy customer base, which is paying for upgrades, new versions and the yearly 20% maintenance fees.
Nevertheless,  I doubt  there is even a single board of directors that is not considering going to the Cloud.
Everyone wants to be in the Cloud. Many ISVs have started offering their services as a Hosted solution (claiming that they provide SaaS) just to have a presence there, and have added a little cloud to their logos or landing page to enhance that impression.

The Motivation
I have been advising ISVs for over a decade on their move to the Cloud (or SaaS as we old-timers still call it).  It is surprising though, how many of the companies I consulted to couldn't articulate why they were doing it.
I always begin with a Motivation session where the C-level staff is gathered to discuss what their thoughts on the matter are, and I find out, time after time, that there is no consensus on this matter.
Since the transition from a Product company to a Service company is so profound, the paradigm change so deep and the costs are not trivial (sometimes they could bring a company to its knees), it is crucial to understand why  the ISV is willing to set on this adventure.
The discussion below outlines the various reasons why an ISV should invest in this process.

Everyone is in the Cloud
Yes, it is fashionable, and I don’t belittle this motivation.  Sometimes a company must have a presence in the Cloud for marketing purposes, just because it is expected and the company brands itself as a forward looking endeavor.  But is this a good enough reason ? I would look into the next items to try to justify the move which is by no means a simple one.

The employees are demanding it
As flimsy as it may sound, this is not dissimilar to the previous motivation. The workforce is getting younger (in age and in mentality). Employees who think that the company is not ‘with it’ might start looking elsewhere, to companies that are more on the bleeding edge of the technology and market. Although this should not be a major consideration, it still is a consideration and might tip the balance toward a decision.

The competition is offering it
As we all know, SaaS solutions are very attractive from many perspectives. If there are a number of new, pure-play SaaS companies that are starting to bite into your customer base, this is a serious consideration. But, it should not be the only consideration; one must examine and compare exactly what the competition is offering. Many times, SaaS companies will offer a small subset, or an over-simplified solution of what your product does. If the competition is up-to-par in most aspects of your solution, you may already be fighting a retreating battle. If not, it is possible that you may lose the smaller customers but could still have an advantage in the larger companies. If this is the case, you should consider the transition to SaaS, while maintaining your advantage, rather than compete with the smaller players.

Your customers are demanding it
That is a very compelling argument. No one knows better that your own customers. If customers are requesting an Cloud solution, the company should listen carefully. Still, it is important to engage with the customers and understand exactly what it is they are looking for in a SaaS solution. Perhaps it is the recurring payment option – that could be done with an on-premise solution as well. Perhaps they are looking for certain features. Maybe it would make sense to offer new options in an integrated version, rather than rewrite the whole application from start.

You want to expand into a new customer base
Often times, the SaaS option will allow a company to reach out into a large customer base that was below the radar because of price or complexity.  While you were selling to the high end customers, there might be a vast group of smaller businesses that could benefit from your solution, but were always out of reach. If this is the great motivator, one should carefully examine what should be offered. Is it the same set of features? Is it a light weight version, or a subset? Are your potential, future SaaS customers different from your existing customer base, not only in size but in their needs ?

You want to expand into new territories
One of the advantages of SaaS is “Anytime-Anywhere”. While the old model necessitated local presence for sales and professional services, the SaaS model will allow the company to sell around the globe without the need to have ‘boots on the ground’.  If your local market is saturated, SaaS is a great venue to reach out to the other side of the globe without the expensive and complex logistics it used to take to make that happen. Here again, it is important to articulate what will you be to selling the customers in the geographies; not necessarily the same as the local offering.

To be – Of course
The discussion above does not suggest that after examining the motivation, one would conclude NOT to offer a Cloud solution. Of course every ISV will have to be there sooner or later, otherwise it will become obsolete. But going through this thinking exercise will allow you to define your approach. Understanding the motivation will help identify the target market, the go-to-market approach, the timing and the offering itself.

As an example:  A large British ISV I worked with, wanted to switch to SaaS by taking its flagship product to the Cloud. After going through the process we came to the conclusion that their product was a cash cow for the foreseeable future, and there was no reason to dive into such a large undertaking at this stage. Instead, the decision was to develop a LITE solution to offer to different customer types in new geographies. This allowed the ISV to experiment with a less demanding task, while learning the ropes of the new architecture and new market approach. I am happy to say that they are launching their SaaS solution these days in two new territories.