Thursday, November 12, 2009

SLA Management for SaaS

“God does not ask about our ability, but our availability.” (Source unknown)

(Yet another chapter in the book - keep the feedback coming!)

As the second ‘S’ of SaaS indicates, the on-demand company is all about providing a service and therefore one would expect Service Level Agreements to be well defined and understood in this industry, but the facts tell another story. Few SaaS companies pay much attention to the SLAs, few companies really invest in it and most customers are quite clueless about it as well.

SLAs are tricky. Every SaaS provider is supposed to adhere to its service level commitments but on the whole, it is a document that most providers tend to keep out of the limelight and out of the conversation with customers. Judging from my experience, many SaaS companies use a single, non-abiding, standard SLA for all customers, keeping to a minimum their commitments and consequences.

An SLA, as its name suggests, is an agreement between the service provider and the consumers, consisting of sections regarding the various commitments to service levels that will be matched or exceeded.
Each section is defined as a Service Level Objective (SLO).

A typical SaaS SLA should have the following SLOs:
  • Service Availability – define the availability of the service represented in percentage (e.g. 99.95% uptime)
  • System Response Time – define response time of various transactions represented in seconds. (e.g. login should not take more than 9 seconds)
  • Customer Service Response Time – a response on customer enquiries should take no more than an allotted time for various services (e.g. enabling a service for a new group should take less than two business days)
  • Customer Service Availability – hours of availability of customer service represented in a ‘hours per day’ notation. (e.g. 11X5 for regular customers, 24X7 for platinum customers)
  • Service Outage Resolution Time – the times it takes to restore a service after an outage has been reported. Represented in minutes and hours (e.g. 30 minutes for a full system outage)
  • Failover Window For Disaster Recovery - how long will it take to restore the service in a disaster recovery site, if disaster disables the main datacenter.
  • Reclaiming Customer Data – a commitment to transfer all (agreed) data in an agreed format in case the customer leaves the service.
  • Maintenance Notification – the advance notice that the provider will notify customers of planned service outages, represented in days. (e.g. a planned downtime that will take more than one hour requires 10 business days notification)
  • Proactive Service Outage Notification - the time it takes for the provider to inform the customer that there are service issues, represented in minutes.
  • RFO (Reason for Outage) – a report to customers following a service outage explaining the circumstance, the incident and steps taken to remedy the problem. (For more information see the chapter on Incident Management). Some customers require an RFO automatically; in some SLAs it is written that an RFO will be generated only following a specific customer request. Usually the company commits to three business days following the service disruption.
Note the emphasis on should when referring to the SLOs of the document. The SLA provided by most on-demand companies consists of two or three paragraphs at most, regarding uptime, customer service availability and perhaps another one of the items above.
Many providers have additional services such as daily reports, daily data aggregations, or FTP services. Each one of these services merits an SLO that should be part of the document.

Some SLOs override others. In the example of an service outage, the Availability SLO takes precedence over the Response Time SLO, as you would not expect the performance of the system to be up to par when the system is down. On the other hand, this will kick start other SLOs such as Outage Notification, Resolution Time and Support Response Time.

Customer Expectations
Not all SaaS companies are created equal. They will vary by maturity, by the vertical they are serving, by the company size they cater for and, of course, by the type of application.
Some applications are core and some are peripheral. Some applications are used around the clock, like metering or call centers and the customers have zero tolerance for downtime. Other applications are rarely used outside of office hours, (e.g. payroll, talent management) and if the system is down, the price is a handful of irritated end-users that will need to take a coffee break earlier than they planned.
Larger customers tend to have more rigorous demands while lower paying customers will usually be more tolerant of the system’s performance and support availability.
Therefore, your SLA should reflect the relative position of your service along the following three vectors:
  1. Customer size (reflecting subscription [potential] size)
  2. Core vs. periphery
  3. Downtime tolerance
So if you are providing a mission critical application to a large customer, whose downtime will cost the customer real dollars, your SLA should be taken very seriously.

Service Level Breaches and Penalties
We have seen the promises that come with the SLAs, but many of these agreements fail to state the consequences to the provider of not meeting the terms.
Each SLO should also define the penalties for breaching the service level commitment.
Penalties are typically specified as a prorated credit for the following month’s subscription fees.
From the customers’ point of view, the penalties should not be flat rated but increase as the service deteriorates, so that the second outage will carry a heavier penalty than the first outage. It is rare that customers insist on this point but those that do will need to negotiate these terms separately.

There is typically a maximum. It is unusual that accumulated penalties will top the monthly subscription costs. There is a catch here. As an extreme example, if your service was down for the duration of the whole month, the customer will be exempt from paying a full month’s service fee – but this is ridiculous of course. The damage to you customers is typically orders of magnitude higher than the subscription costs.
Many SaaS customers commit up front to a year or more of service, for a reduced subscription price. A good SLA will include a section that allows the customer to breach the extended commitment if the provider failed to adhere to the service levels for, say, three consecutive months.

The next chapter will outline what all of this means to the Service Operations group and why should you care about issues that initially seem to be in the domain of Sales, Legal and Finance.





Sunday, November 08, 2009

Inter-department Communications

(Yet another chapter in my upcoming book on SaaS Service Operations - Your feedback has been great so far; thanx and keep it coming.)

"The Problem with Communication is the illusion that it has been accomplished" - George Bernard Shaw

While it is true of any institution, communications between the various silos of the organization is particularly vital for the successful operations of a SaaS company.

The reason are that things happen much faster in an on-demand company, customers are in constant contact with the company and expectations are high for a fast turn around.

At a product company, when bad things happen to the application, nine times out of ten, the software company doesn’t even know about it, and the customer’s IT deals with it. The end user is rarely in touch with the product provider. The product salespeople tend to ‘shoot and forget’ once the commission has been paid. If things go bad, the customer can mostly blame itself for not deploying or maintaining the software correctly or for not doing its due diligence.

Multiple channel interaction
At a service company, on the other hand, a typical customer will interact through multiple channels continuously. The CIO may have a direct line to the SaaS CEO. The IT department may be in touch with professional services, and managers of the service on the customer end could be speaking with the Program Management group. Members of the Operations group will inevitably be in touch with supervisors or IT managers on the customer side, and Sales will have developed personal relationships with managers on the customer’s side, as they nurture the relationship to expand the sales in-house. And, of course, the end users might be in daily contact with Customer Support.

Customers, naturally, will be irritated when things aren’t going smoothly regarding any one of multiple scenarios. It may concern a delayed service initialization, an undelivered bug fix, an incomplete customization, an unsatisfactory report, or (ouch) a service outage. Part of the allure of on-demand service is a much faster turn around time in every aspect. The customers believe it and expect it.
Imagine the customer’s frustration when they call in any one of their contacts within the company to inquire about unresolved issues, and that person has no idea what they are talking about.

Disconnect between the groups
Typically, a SaaS company will be using a CRM that serves Sales and Customer Service. In many organizations the Sales view is radically different from the Support view and information available to one is not available to the other.
It is rare that other members have access to the CRM. Operations, Engineering, Professional Services and Program Management keep their own records in different systems for various reasons and are not trained in using a CRM.
Not surprisingly, the different silos do not have much knowledge of what each department is doing, and I have seen continuous tension between various groups and quite a lot of finger pointing when bad things happen.
It is also typical to see a startup company, where everybody occupies a single open space office, yet where so little communication takes place between the groups and political affiliations begin to form.
Resources are always limited and the demands are constantly growing; how does one prioritize the tasks and attention to a particular customer?

Service Outage and Communication
To illustrate through an acute, but none too rare, example: Many a time I had experienced a service outage that, for obvious reasons, took everybody’s focus and energy. A couple of offices down the hall sat the Sales team and across the continent were various regional Sales reps. They were not informed of the outage since they play no role in detecting, classifying or resolving the issue, and all those that knew about it were busy trying to fix it, or taking customer calls. Often the customers, especially the senior members who have established a close relationship with the sales reps, would call Sales or Program Management immediately asking for updates. The uninitiated sales rep would answer that they are not aware of any outage and perhaps the problem is local to the customer (This would usually trigger a nasty remark about the incompetency of the provider). The experienced sales rep would mutter something in embarrassment and then storm over to the Ops group demanding an explanation why, once again, Sales was not notified of the outage. Not only does the company look bad, but it also raises unnecessary tension between the groups
(This issue will be addressed in the chapter on Incident Management)

Recurring Mandated Meetings
Inter-department communication is the answer. If the managers of the different departments talk to each other on a regular and formal basis, issues can be addressed before they get out of control, plans can be communicated and a deeper understanding of the challenges of each department can be better understood.
Since Operations is at the center of it all at the end of the day, and since Operations will take the blame for whatever incident that occurs, VP Ops group should initiate these meetings. This initiative and meetings will also serve as an important PR tool for the service operations group.
Following are the inter-department sessions that should be standard in a SaaS organization to improve communication and visibility and to help prioritize tasks and address issues before they boil over.

Name: Daily Operations Sync
Frequency: Daily (15-20 min)
Suggested Time: Late afternoon
Participants: Operations, Support, Program Mgmt
Agenda: Burning issues, Service outages, Planned maintenance, Delayed deliveries, Staffing

Name: Customer Success
Frequency: Weekly
Suggested Time: Monday
Participants: Sales, Program Mgmt, Support, Operations, Professional Services, R&D
Agenda: Customer Success Score sheet, Updates, Delays, Priorities. Address Red and Orange flags

Name: Operations-Engineering Sync
Frequency: Bi-Weekly
Suggested Time: Anytime
Participants: Operations, Engineering, QA
Agenda: Requirements, Releases, Known issues, Bugs, Dev/staging environment


Name: Company Fridays
Frequency: Bi-Weekly
Suggested Time: Friday Afternoon
Participants: All employees + food & beer
Agenda: Announcements, updates and department presentations


Name: SPOF Analysis
Frequency: Quarterly
Suggested Time: Anytime
Participants: Operations, Engineering, QA, Product, Support
Agenda: Single Point of  Failure Analysis(In the book, these meeting would be discussed in more detail)

I cannot emphasize enough the importance of these meetings. Not only do they facilitate the smooth operations of the company, but they also foster better relations between the company’s groups.



Monday, November 02, 2009

Introduction to the book on SaaS Service Operations

"You can't handle the truth!" (Col. Jessep in 'A Few Good Men')

(Note: This article is part of the STORM™ methodology)

As I have mentioned in a previous post, I am working my way through writing a book on SaaS Service Operations. Using the web as a collaborative tool, I have decided to share my work, bit by bit (three chapters, so far) to test it within the community and get live feedback from those who matter, potentially those that would read and recommend it.
Following is the (draft) introduction chapter. I would dearly appreciate your feedback on content, style, typos, grammar and whether you might find such a book an interesting read.
My initial thoughts about the title are along the lines of 'Survival Guide' or 'A day in a SaaS Emergency Room'.
I am not fishing for compliments - it will beat the purpose, and yes, I can handle the truth.
Many thanx,
Dani

Introduction – or Why am I Writing This Book.

Well, someone has to write it. Numerous words have been exhausted over the years on matters SaaS, but I have seen very little being written about SaaS Service Operations, and there are no books on this subject that I am aware of.

As SaaS is becoming mainstream, it has also become the most visible and mature service in the Cloud stack. Consumer expectations have elevated such that they are demanding fast response times and a service that delivers on the availability slogan of ‘anytime-anywhere’. These expectations do not refer only to the application; but also it is expected of the customer and professional services as well. SaaS companies often excel when it relates to the first ‘S’ of SaaS, i.e. Software, but fair quite poorly with regards to the second ‘S’ – Service.

What started as an experiment of the few and the brave, will soon become the major force in the software market, and what will differentiate one company from the rest is no longer the on-demand allure or the feature set, but the level of service it provides.

I am a war veteran in this respect and have many scars to parade. There are probably very few mistakes that I have not made. Being a descendant of Homo Sapiens Sapiens, I like to think of myself as one who has learned from his mistakes and taken steps to remedy them.

‘Operational Fatigue’ is a term I coined after the umpteenth time I was awoken in the wee hours of the morning to handle an outage that occurred yet once again, after having seemingly fixed the problem two weeks prior. I could have just as well created this phrase after the two hour scheduled downtime to upgrade the service. The upgrade turned into a nine hour nightmare that was finally resolved (a couple of minutes before our major customers started their workday) by some engineering heroics. As always, these were followed by heart wrenching phone calls to the CEOs of our customers to explain what went wrong (again) and why it would not repeat.
No wonder I grind my teeth at night.

Throughout my years of practice in this space I have discovered a number of traits across the industry:
  • Most SaaS companies are structured and behave in a similar fashion
  • Most SaaS companies lack the discipline, the tools and the practices to provide an efficient and effective service operation
  • Most SaaS companies, therefore, end up paying the price of not meeting their SLAs, which leads to customer dissatisfaction, customer churn and ‘Operational Fatigue’
The intended audience for this book is whomever is responsible for the quality of customer service. That includes the CEO, the CTO, VP Engineering, VP-Director-Manager of Operations and VP-Director-Manager of customer service. All of these functions must work in unison to ensure a smooth operation both outwardly and internally.

This book is divided into four sections:
  1. The first section introduces concepts about SaaS, the evolution of the market and why the model is here to stay. Enough has been written about the subject so I will stick to some of my observations without going into a long dissertation.
  2. The second section contains insights on service operations in an SaaS company. It includes various posts published on my blog (‘Dani’s Perspective on SaaS’), over the past year. It discusses typical SaaS operations, discipline, transparency, outsourcing in the Cloud, metrics, inter-department communications, etc.
  3. The third section covers Operational Support Systems that might or might not be supported by the product. They include: Billing, On-Boarding, De-provisioning, Integration, Retention Policy, Communication and more.
  4. The final section is instructional and lays out the principles of my adaptation of ITIL for SaaS Service Operations™ . It explains what ITIL is and why I chose ITIL as a basis for defining the practices of running an efficient and effective service operation. It covers six practices that I have developed and refined throughout the years at various companies with whom I worked either as an employee or as a consultant.
By following the practices, following the workflows and deploying the tools outlined in this book, SaaS companies can instill the discipline needed to reap the benefits in a surprisingly short time.

It is not complicated, it is not expensive, nor is there sorcery involved - it only requires awareness and leadership.










Sunday, October 25, 2009

Cloud IaaS: Sorry, not very Interesting

“There is an incessant influx of novelty into the world, and yet we tolerate incredible dullness” – Henry David Thoreau

Don’t get me wrong. Infrastructure-as-a-Service is a wonderful, useful and logical development. I do not need to sing the praise of it here. I believe in it and I am sure that it will provide a growing, significant percentage of computing needs around the globe.

But, it is just not very interesting, although it is the rage in all IT circles and hype generators. The technologies that enable it are basically: high speed bandwidth, virtualization and sophisticated management software. Now, I do not belittle these technologies. They are the product of years of development of ingenious engineers and some fast acting companies that had the ability to put one and one together and come up with the offering. And kudos to Amazon Web Services on leadership, ideas and execution.

Still, I believe that it is the domain of the few, and although every datacenter and ISP out there is starting to offer a ‘cloud’ solution, the end result will be a few very large companies that are big enough to invest in a model that makes economic sense and are sophisticated enough to pull it through.
So what does that say for technological companies that are thinking of providing IaaS-enabling software or hardware? There will survive only a handful of those companies, since they will be competing in such a small market.

So why is it such a hype, and why is it burning like a bushfire in the Kalahari savannah, while it took almost a decade for SaaS to become mainstream? Because the idea of IaaS is very simple and straightforward. IT gets it. Any old CIO can understand the concept, because hardware is a commodity and has been for a long time. Because many enterprises have been hosting in co-los for decades, acting as if their hardware is in their datacenter.
Once you get over the fear of losing control and get through the blah-blah of security, the idea of IaaS is very simple, and therefore, not interesting.

SaaS on the other hand is all about Applications. And applications are not perceived as a commodity (although many of the non-core applications are beginning to assume that role – and that’s a good thing). Therefore, once the hype will run its course and the dust Clouds will settle, IaaS will become mainstream. Every enterprise will choose how much of its infrastructure will lay outside of its firewalls and to what extent it will use the flexibility of the solution. SaaS will still be the interesting item, since every ISV will offer an on-demand solution, and the competition will continue to generate innovation and breakthroughs.

Tuesday, September 08, 2009

SaaS 70 – Nextgen Certification for On-demand companies

“A certified lunatic is certified nonetheless” (Dani, 2009).

I was asked by one of my readers (note the plural) to include a chapter on SAS 70 in my upcoming book on SaaS Service Operations. I must admit that I was not sure if he was advocating SAS 70 or he wanted me to discuss certifications for SaaS, since I am not a fan of the former but a promoter of the latter.

Confusion is defining the SaaS market when it comes to certification.
Enterprise IT personnel certainly do not know what questions to ask, so they generate these long RFPs that are very similar to the on-premise RFPs, and they slap on top of it security questions that make their CSO officer feel important with a multitude of acronyms that are either relevant or not. Most on-demand ISVs wouldn’t know how to define a ‘certified’ SaaS either.

The good news is that the customer base is demanding assurances. While a few years back, the concerns were mostly security and mostly compared to on-premise solutions, the market is maturing and now there are a myriad on-demand solutions for every vertical or horizontal aspect of applications.
So how does an IT professional distinguish between the good and better solutions? How can she judge whether the SaaS provider will stand up to its SLAs, whether the data is secured and operational procedures exist and are followed?

SAS 70
The truth is, there are no authoritative answers to these questions nowadays. With a glowing lack of SaaS certification the only default out there is SAS 70.

Statement on Auditing Standards No.70 (SAS 70) is an internationally recognized auditing standard developed by the American Institute of Certified Public Accountants (AICPA) in 1992. It is used to report on the "processing of transactions by service organizations", which can be done by completing either a Type I or a Type II audit. A SAS 70 Type I is known as "reporting on controls placed in operation", while a SAS 70 Type II is known as "reporting on controls placed in operation" and "tests of operating effectiveness" (http://www.sas70.us.com/what-is/definition-of-sas70.php)

(Disclosure: I have not undergone a SAS 70 audit in the companies I worked for. My knowledge is based on reading and sharing other companies’ experiences)

What’s good about SAS 70
The fact that SaaS companies want to take the extra (expensive) step to distinguish themselves from the rest of the pack, shows a level of maturity and seriousness about their business. SAS 70 requires that you have a set of practices and that you are following them.
This in itself is a big step forward for most SaaS companies – they actually have a set of defined practices.
Sorry, only two short paragraphs on the benefits.

The shortcomings of SAS 70
This audit was not defined for SaaS. It was developed in 1992, years before even ASPs were in vogue. It is a general audit for service organizations and covers a wide range of businesses, from credit processing, to medical insurance and data processing.
There are no specifics for an on-demand software company. Heck, there are no specifics for a software company either.

Please note the language “A SAS 70 audit helps companies meet regulatory compliance…”, and “a SAS 70 audit provides an additional layer of accountability…”
Nowhere does is state that it certifies the company at any level, other than the fact that the audit was done.
It reminds me of cosmetic advertizing “makes your skin feel younger” – how very scientific.
There are no recommendations, no standards to meet, no right or wrong. It merely states that you have practices (good or bad) in place, and that you are following them.

As mentioned, the mere fact that there are defined practices exhibits a level of maturity, so I do not belittle the exercise, but there are no provisions in SAS 70 to avoid documenting your bad practices and following them through.

SaaS 70
There is a dire need for a certification program for SaaS companies as the domain matures and SaaS becomes a major component of IT.
IT wants to know that you are a competent service operator, that you are running a tight shop and that the service will be around next Thanksgiving.
I am suggesting a certification program, currently named SaaS 70 (to demonstrate my famous wit), which includes three elements:
  • Service Operational Maturity – Has the company defined and implemented practices and procedures for running a robust operation, to ensure that SLAs are met? This would include Change Mgmt, Release Mgmt, Incident mgmt, Event Mgmt, Availability Mgmt, On-boarding, de-provisioning, integration, data retention, etc.
  • Security – covering all aspects of password policies, data separation, vulnerability testing, virus protection, privacy, etc.
  • Service Continuity – examining the financial viability of the company and what plans are in place to continue providing the service even if the ISV goes belly-up.
Within each component the company will score a level of maturity beyond a pass/fail that comprises coverage, depth, documentation, and tools. And, of course, the report will include recommendations for improvement and scaling up the maturity ladder.

Only with such a specific, SaaS-centric, verifiable and accountable program, will the consumer of these on-demand services know that a company can or cannot meet their expectations.



Thursday, August 20, 2009

Discipline (or lack thereof) and Operational Fatigue

“Half of life is luck; the other half is discipline - and that’s the important half, for without discipline you wouldn’t know what to do with luck”- Carl Zuckmeyer

Creative and nonconformists
SaaS companies are mostly composed of a group of highly capable software engineers. These techies are, by nature, creative, imaginative, out-of-the-box engineers, inventing new ideas or new ways of achieving better results. They tend to adopt the latest and greatest technologies and are always looking forward to the next best thing.
Naturally, these engineers are nonconformists and not inclined to follow rules or to stick to routine.
Almost always, they do not come from an enterprise IT environment, where rules and regulations are stricter and operational practices are followed almost religiously.
With the nascent state of SaaS, if the engineers have prior experience, it would mostly come from on-premise, product companies that emphasize features, versatility and usability.
They rarely had to deal with customers, and bugs that were found were handled according to their priority to be fixed in the next release (which could be months away).
Therefore, typical SaaS engineers lack the necessary discipline to run a 24X7 service, and are usually hostile to restrictions imposed on them.

What, me worry?
The lack of discipline manifests itself mainly in Change Management and consequently in Asset Management and, then, consequently in Incident Management.
This refers to what changes are allowed to be done when (‘hey, just to let you guys know, I installed the new patch during lunch break’), how are they approved and communicated (‘yeah, no prob, I tested the code on my laptop – it is foolproof, just a small change in the parsing engine’) how they are recorded and rolled back if necessary (‘don’t worry, I keep all changes in a dedicated notepad on my machine’).
There usually are no rules about touching production. Typically, every engineer has full SUDO access to all servers in the data center, using a single super-user login, so that activities cannot be traced to any specific person.
One-offs can be installed on a particular server and not be documented. Months later when a new version is installed or a server replaced, things fail to work and it may take hours for someone to remember that a special component is not functioning any more.
Lack of a fully functional staging environment may cause an engineer to ‘temporarily test’ some feature on a production machine that either causes service disruption or is forgotten until the fan turns brown.


Operational Fatigue
Operational Fatigue is a term I coined after years in the trenches, of waking up at 3:00 AM to deal with the same problem that hit us three weeks ago; of the stress of dealing with an incident at peak time when Management is hysterical, when Sales are complaining, when Support is overwhelmed with frustrated customers; of making the calls to the high profile customers, explaining, apologizing, promising; of having to explain to the Board why we lost so many customers this quarter.
It gets to you. You discover new gray hair and develop a fear of answering the phone.

The point is – it is avoidable. Instilling the practices and discipline can make life so much easier and allow the ops team to plan and improve instead of fighting fires all the time.

Educating the young
Like toddlers, engineers crave for guidance and discipline, but as most parents would testify, they will make every attempt to break the rules and stretch the envelope to test the boundaries of their environment. Experienced parents will tell you that the young children feel much more secure when they know the rules and when the rules are being enforced. It has been my experience that when I introduced a new set of regulations such as in Change Management, there is always an initial push-back, mumbling about bureaucracy and attempts to circumvent the rules in the beginning. But I have always seen a quick adoption of the new regulations, followed by a realization that life would be so much better if we only stick to the rules – these guys are smart, you know. Many a disaster was avoided by playing the game by the new rules and I found out how quickly the engineers embraced the discipline and started devising ways to improve on and automate the processes.

Just do it!
I recently participated in a round table hosted by HP on the subject of Change Management. Most of the participants were from large IT shops and were talking about adapting to new Change Management processes in terms of six to twelve months. I was astonished. I concede that my background has been with much smaller groups, and I had the full backing of the executive management, but twelve months? Jeez!

The process in my experience was:
· Prepare the documents, templates and work-flows.
· Make a compelling Power Point presentation.
· Present to the Engineering, Ops and Support groups.
· Emphasize the consequences of not following the practice (genitalia hanging at high altitude)
And Voila - It works! A few weeks later you have a spiritual following of admirers, because the fruits of the labor are so obvious in a very short time.


Thursday, August 13, 2009

Transparency in SaaS Service Operations

“Life is filigree work. What is written clearly is not worth much, it's the transparency that counts.” - Louis-Ferdinand Celine

Companies like to boast about their transparency, but in practice, information dissemination is highly controlled. At an on-demand company, hiding the backstage operations seems like a smart thing to do. As long as you are servicing the customer, and as long at the customers do not complain, why should you wash your dirty laundry in the public?
So what about SLAs? The guiding principle seems to be ‘Don’t worry about them if your customers do not demand them’. And even when they do, there are SLAs and then there are SLAs. There are so many ways to interpret these elusive numbers (assuming you even know the real ones) that most companies will portray better results than those that reflect reality.

Varying degrees
There are different modes of Transparency communications; from the non existent to the reactive, the proactive and full disclosure.

The reactive type is the common case where there are service disruptions and customers call in to complain. In this case you will determine how much information you would like to divulge. This could be done with a customer call, an RFO (Reason for Outage) that is sent to particular customers or a message on the corporate site.

A proactive approach would have a Service Status Page depicting the current service availability of the various production systems.

A full disclosure mode will provide customers with a historical view of production systems availability and response time such at Salesforce’s Trust or SAManage’s Status Page .

Advantages of Transparency
My experience has been that the more transparent you are with your customers, the better relationship you will foster with them and the more forgiving they will be when things turn sour. And things do turn sour; it is unavoidable.
Your customers are not dumb (in general, that is – I can relate many amusing stories of individuals that should have not been awarded fourth grade graduation, but that is another story). The people on the other end generally understand that you are dealing with a complex environment with many factors that are not always under your control. They will be willing to accept that scheisse happens, but they also must know that you are ready to accept responsibility and learn from these events. There should be a closure process for each event including Incident Recording, Post Mortem, RFO communication (more on that in Incident Management).
Of course, nothing beats a good, reliable, available and responsive service. If you are not able to provide that, you will end up loosing your customers regardless of how much camouflage and finger pointing are used to cover the smell.


How transparent should you be?
I am not advocating that you have to run out and tell the guys every time you messed up or that you should bombard the customers with a technical exposition as part of the RFO document.
Striking the balance is an art that comes with practice and common sense. If an incident occurred that did not disrupt services, you must undergo the full Incident life-cycle practice to ensure that lessons are learned and the incident will not repeat. But you do not necessarily have to go and boast about it.
As for the RFO, in my days I have been asked to put my signature on many customer facing documents that had a bland, general, canned message that meant nothing to the reader. (“service was lost do to a system failure”). I realized that customers will not trust the messaging and choose to either ignore it while snorting in disgust or have a techie call in and start drilling the poor customer service rep for technical details which would be hard to provide.
I have also seen RFOs that contained multiple pages and read like a PHd dissertation in electronic engineering. I do not know who approved these RFOs and if the purpose was to wear down the suffering reader so that further RFOs will never be requested.

Company Culture
And finally, keep in mind that if the company’s culture tolerates half-truths and spins when facing the customer, you run the risk of it percolating through the company’s internal activities and reports. Don’t you expect your employees to be truthful, accountable and not shy away from reporting mistakes, even if it makes them look not too great? Your customers have to expect your company to do the same. And, if the results of truthful reporting will cost you a customer then something was probably wrong with the relationship to begin with, and the customer may have been looking for an excuse to break away.

Monday, July 20, 2009

Can SaaS Companies Go Back to Basics?

"Change is a bouncing ball on the circumference of a circle"


I was recently at an AWS conference and met with a substantial number of SaaS companies that are running their full production on the Amazon EC2 and S3, albeit all were relatively early stage, smaller companies. I spoke with half a dozen VP Ops or their equivalents, and all stated that they were satisfied with the service and the uptime, and did not experience major outages.
I have also met recently with a number of successful SaaS companies that we under 20 people total – and that includes R&D and Sales & Marketing and running the 24X7 operations.

So it got me thinking that if SaaS companies can do well without the need to deal with hardly any aspect of the infrastructure we may be approaching a completion of full circle.

History 101

Since the dawn of time (January 1, 1970 – Unix time, that is) there were software companies. If they were successful, they excelled at writing software and testing it, and with time developed good professional services capabilities. And of course they needed to know how to market themselves and sell, and partner – but that was true of any company out there, whether they were manufacturing rubber gaskets or CAD software.

Fast forward to the post-boom, post-ASP era, and a new breed of software vendors appeared on the scene. As they were pioneering the new on-demand model; they all owned their infrastructure, and probably some of them were even hosting the hardware in their own back office.
These new SaaS vendors had to have expertise in their domain and their software, of course, but also in operations, 24X7 customer support, servers, power, storage, DBs, networking, security, performance and load testing, on top of the mastering the model of selling services rather than software.

As the market rapidly expanded, SaaS enablement companies grew around these new vendors, and started offering hosting at first (real estate and power), then networking capabilities, and then basic network monitoring services.

Two trends developed. On one hand, Managed Services companies (e.g. IP-Soft) offered to take over all the routine operations of managing the infrastructure up to the application level. That included monitoring and maintaining the network gear, servers, storage, DBs, Web servers and their respective operating systems.
On the other hand one saw the rise of Managed Hosting companies (e.g. Rackspace).that rented out the hardware itself on top of the real estate and offered ever growing services around the hardware.

And, there are companies (such as OpSource) that offer everything from hosting, to servers, storage, application management, 1st and 2nd tier helpdesk, as well as reading you bedtime stories.

Now we are seeing companies that offer QA services (especially performance, but not limited to), security services, integration services (AKA Professional Services), 24X7 answering services and tucking you into bed.

It is too early to tell how successful this ecosystem will turn out to be, and what percent of SaaS companies will subscribe to this model, but the emerging trend is clear – SaaS companies are offered the opportunity to go back and do what they do best – write software.

Nobody Does it Better

The question is where do you draw the line? Keep in mind that the success of a SaaS company relies mostly on the second S (Service) and less on the first S (Software), or put another way – depending on the execution more than the quality of the software.

I see a number of areas that must be directly managed by the SaaS staff:
  • Product development
  • Customer relationships
  • Application management
As for functional testing, performance testing, security testing – they may all be outsourced, but never relinquish control of these processes.
Ditto for professional/integration services – you may hire an outsourcer/partner to perform these functions, but ultimately, the customer success lies at your door.

My friend and SaaS networking expert, Gil, says that you should never outsource the infra or the management of it because nobody will take care of your baby as good as yourself.

I recently had a conversation with a managed services account manager that confided in me that managing the infra of SaaS companies is far more difficult than that of your average enterprise IT, since the SaaS companies are far more sophisticated, have deeper technological understandings and higher availability and response requirements.

Another question is at what point do you want to take back ownership? Does a certain size and complexity of the service and business justify bringing in your own teams of experts to handle those tasks listed above? The cost of doing it yourself will probably start going down as you grow, but the company’s values might dictate sticking to the core competencies – Hey, isn’t that what the SaaS offering is all about?



Wednesday, May 20, 2009

Questions that SaaS executives must be able to answer - KPIs that matter.

“There is much pleasure to be gained from useless knowledge” (Bertrand Russell)

It has been my experience that SaaS executives have trouble answering the most basic questions about their service operations, and mind you, this is what the business is all about.


Again and again, I keep coming back to the conclusion that the fact that state of SaaS Service Operations is so dire is due to the fact that on-demand companies are built on the first ‘S’ (software) and not the second 'S' (Service).

SaaS entrepreneurs are, in general, bright, creative, out-of-the-box thinkers. They are software developers and have no clue about IT practices and disciplines.


The age old premise “if you can't measure it you can't manage it” somehow escapes SaaS companies across the globe, until it becomes a huge problem.

Have you gone through the numbing process of presenting a specific customer with their real SLA adherence? I have. On average, it would take me a few hours of going through multiple sources of data to come up with (sometimes) accurate data.


Following are a number of questions (an incomplete list) that every SaaS executive should be able to answer in her sleep, or at least with a click of a button.

1. Availability management
  • What are your real uptime numbers?
  • How do the trialing twelve months (TTM) look like
  • Are we better than we were six months ago?
  • How many outages have you had in the last M months?
  • What is the breakdown, based on severity?
  • What is the breakdown, based on downtime causes?
  • How many service disruption incidents were repeated?
  • How quickly do you recover from outages?
  • How many days have gone by without a critical, major outage?
2. SLA Management
  • How does your availability match up to your customer commitments?
  • Which customers were affected most (even if they do not complain)?
3. Change Management
  • How often are changes made to the production environment?
  • What is the breakdown of changes by category?
  • What percent of changes did you have to roll back?
4. Asset management
  • What is the status of your inventory? What box is located where?
  • What function or customer would be impacted by a loss of a certain box?
  • When do your support/software contracts expire and what might it affect?
5. Cost Management

  • What are the actual costs of the operations?
  • How is the budget allocated among the various components?
  • How much does each new (N) customer(s) cost?
  • Are we getting the full value from our supply chain?
6. Churn Management
  • How many customers have you lost in the past 6, 12, 24 months?
  • Is your customer retention improving over time?
  • What percent is your customer churn out of your customer base?
  • What is the average retention time of your customers?
  • What is the breakdown, based on reasons for churn?

I am well aware of the fact that there are no integrated solutions for the SMB supporting a database for these crucial KPIs, but every company should have some form of repository capturing at least some of the data and a easy way of extracting it.

The important issue here is that SaaS companies should be aware of these KPIs and start asking these questions, even if they do not yet have all the answers.

Saturday, March 21, 2009

Maturity Model for SaaS Service Operations

Given the state in which most SaaS companies are, and the fact that within a very short time span, ISVs will be equated with SaaS, I believe it is time to offer a Maturity Model for SaaS Service Operations.

A methodological approach would be to create a table of practices and mark a numeric value, in order to quantify the maturity state of a company. But that is too simplistic and doesn’t take into account that various practices will exist in some form at each maturity level.

The following is a proposal (a first draft), and I invite all to comment and help zero in on the right model.
Note: Release management is not covered in this model. It is arguably a role shared by the Product, Engineering and Ops groups.

Level 1:
This is probably where most of the SaaS startups are right now. The Ops team is either non existent or consists of a sys admin, help from engineering and a cat. None of the ITSM processes are defined, and there is no orderly asset management in place. Customer support may be handled by a small dedicated team, or even by Engineering.
In the latter case, 24X7 support consists of the cell phones of the CEO and VP Engineering.
Event management is at a very basic level, reporting whether a server is up or down.
Perhaps, a daily backup of the database is in place.

Level 2:
A small operations team is in place. Probably run by a manager/Director level person. A network engineer and a sys-admin make up the team with some help of a part time DBA.
Asset Management consists of a number of excel sheets, not necessarily up-to-date or all inclusive.
A half baked Change Management process is defined, but not really adhered to. Engineers still have access to the production system. A customer support team is in place. Not yet a 24x7 operations. Incident Management consists of people running around like chickens with their heads cut off, but there is a recording of the incident in the CRM, or perhaps a ticketing system. Event Management is implemented through a tool like Nagios or Cacti (freebies, of course) and email alerts are sent on threshold breach. There may be thousands of email alerts sent a day, so that real alerts drown in the flood.
A full daily backup of the database is in place and an hourly differential backup is taken.

Level 3:
A VP Service Operations runs the team of Customer Support and Operations. Change Management and Incident Management are defined and implemented. There is an Asset Management DB which is linked to Change and Incident. Change Window is defined and a Change Calendar is used. A semi automated notification process is in place (internal and external notifications). A staging environment is in place, although it does not fully reflect the production environment. Event Management is better controlled, noises are filtered out from the alerts, and some application level instrumentation is incorporated.
Getting individual customers’ SLA is still a manual process, though the information should be available.
A seed of a disaster recovery site is in place. It may take many hours to get it up and running (including transferring the data), but an alternative site with the basic functionality is available.
SAS 70 Type I should be in place at this stage, or at least have a good story about how you vendors are all SAS 70 Type II. (Mind you, I am not an advocate of SAS 70, but it seems like the industry is pushing for this, or at least a bunch of compliance consultants are)


Level 4:
Event Management is fully implemented: Application level monitoring is in place. Synthetic transactions are generated from multiple global locations. Alerts have context sensitive pointers to knowledgebase. A 24x7 NOC is implemented with a dashboard of all event feeds.
An Incident DB is implemented, which is used to generate SLA reports, incident analysis and availability analysis. A Change DB is in place used for Change analysis and for Incident Management. A CAB (Change Advisory Committee) is defined and regular meetings are scheduled. A Service Status page is in place with up to the moment status reports on the services. Customer and Internal notifications are automated and a full Incident Management closure process is implemented. Management reports are available for service status, trailing N months, SLAs across customers, availability across customers and production systems.
Customer/Component mapping is in place.
A Staging environment exists, fully mimicking the production functionality (not necessarily the network/server setup).
A secondary site is up and running with full functionality and a synched database. Switching between sites should take less that one hour.
SAS 70 Type II compliance is in place.


Level 5:
Bliss. ITIL practices are implemented across the board. (I am not advocating ITIL proper, but I am using the vocabulary to describe the practices). A functioning, up-to-date CMDB is the heart of the system (yeah, dream on). Application management automation is in place. A full Staging environment is in place, fully representing the production environment 1:1.
Quality of service takes a leading role and continuous improvements are sought.
Transparency and customers communication is at the highest level. Executive management has full visibility into every aspect of the service operations. All practices are linked and managed through a comprehensive ITSM management suite.
A complete disaster recovery site is up and running with the ability to switch between sites on the fly.

There. Step one in defining the Maturity Model for SaaS Service Operations is complete.
I hope to get feedback to validate the model.

Tuesday, March 17, 2009

SaaS and Automated Application Management

A quick blog this time.

I have been asked by a great new company Nolio, to write a few blog posts for their new blog site.

Nolio automates all key processes needed to service and manage applications across your data center, improving application uptime and quality, while streamlining operations for immediate productivity gains.

I have seen their product and was impressed to the point that I am hooking them up with a number of SaaS companies.

Please read the two blog posts. A third is on the way.

Saturday, January 24, 2009

Of Dinosaurs and Men – Why Traditional ISVs Will Fail On SaaS

Dinosaurs were an extremely successful model. They roamed the earth and multiplied and were the indisputable rulers of this planet for hundreds of millions of years.
They were not very fast, necessarily, nor even nice to their customers (some of them are reputed to have actually eaten their customers), but they were successful because they had spent millennia adapting to the existing environment, perfecting their model to make the most out of it.
Then an unexpected event occurred (some scientists believe a meteor hit the earth creating a nuclear winter while others claim it was fast, cheap internet and tightening IT budgets) and soon all dinosaurs went extinct. Well, not all. Two groups were not annihilated. There were those alligators that stayed in their swamp (niche) and are doing pretty well, thank you, still today. The other group consisted of small reptiles that were driven to grow wings since the larger crawlers ate up all the easily available resources. These birds were lucky to be able to adapt quickly to the changing environment and survive the downfall of their relatives.
The fast moving, warm blooded mammals were better equipped to deal with the new brave world and many have grown to become true behemoth.
In a previous post I revealed the fact that I have mostly stopped advising the traditional on-premise, enterprise, perpetual, software vendor on the transition to on-demand, subscription model, i.e. SaaS.
This is not because I do not believe that it is a smart move, or that the ISVs would not benefit from the transition. Far from it! I envision a world, not too far in the future, where on-premise software would be the exception, not the rule, and even that exception would point to a dwindling model that would survive in niche markets only (swamps) .
My experience, which is supported by many famous (SAP and Avaia for starters) and less famous companies, has been that most traditional, on-premise, enterprise ISVs will fail miserably in the transition to SaaS. I have advised to companies that started out with great enthusiasm that dwindled to a silent death. They simply do not have the DNA for it.
I am talking about the right STUFF that is inherently lacking in established enterprise ISVs that will allow them to make the successful transition. This is not a comment about these companies’ value or success. It is usually inversely proportionate. The more successful the company is, the more entrenched it is likely to be in doing things the ‘right way’ – right, as far as the traditional model dictates.
These ISVs have a product view, not a service view. Their emphasis is on features not serviceability. There is a lot of push back from every silo in the organization, for change, in general, and the SaaS change in particular. It requires a paradigm shift in the organization, and the bigger, more established that organization is, the more difficult it is to bring about that change. (See Impact on the ISV Organization July 02, for a detailed account)

Until a couple of years ago, one could say that most ISVs just don’t get it. But that is no longer the case.
Many traditional ISVs saw their market share being cannibalized by these fast moving SaaS companies. Many heard their customers ask about an on-demand offering and many understand that it is vital that they have a “me too” offering. One cannot ignore the changes in the market and shrug it off as a fad. SaaS used to be a way to work around IT; now CIOs are building on-demand strategies for their business and even starting to use on-demand tools in IT.
So, there is a much deeper understanding of the need to offer an on demand service, but very few ISVs understand that it means a total commitment from the executive level and down.
Not that it is impossible. I have worked with a company whose board made the decision to go Services. They replaced the CEO, who in turn replaced all the senior staff, save the VP engineering. The new VP Sales brought in a fresh new sales force. Then they went through the process of rewriting most of the application from scratch. This process took about a year. They are now a successful SaaS vendor, but they got as close to re-encoding their DNA as possible.
And, of course, there known successful enterprises such as Oracle on demand, HP SaaS (former Mercury Managed Services) and others that had successfully launched their on-demand services, but they are the exception to the rule.

Dinosaurs were magnificent creatures and it sad that we don't have them around any longer (except on isolated islands in the Pacific), but their only fault was that they were too successful for the 'old world' model. I wonder how many software alligators will still be around a decade form now.