Thursday, December 01, 2011

The Black Swan Event in SaaS Operations

 "I find that the harder I work the more luck I seem to have."  - Thomas Jefferson

Nassim Taleb’s eye-opening books 'Black Swan' and (to a lesser extent) 'Fooled by Randomness' discuss the rare, unexpected and almost impossible to predict events that have a major impact (and usually tend to be disastrous). He calls these events Black Swan events, and gives samples such as World War I, stock market crashes,  the PC, the Internet, and 9/11.
Interestingly enough, all the Black Swan events are easily rationalized after the event, by hindsight.

The Black Swan analogy is borrowed from the notion that while one can induce a hypothesis from observational data - e.g. all swans are white - one cannot prove that hypothesis, since after observing numerous white swans, it takes only a single black swan to refute it. Karl Popper, the science philosopher, made that notion popular in his discussion of the Scientific Method (The Logic of Scientific Discovery).

SaaS and the Black Swan
Have you ever lost your database only to find out that the backup files were deleted the previous day? Have you ever hit a major problem with a component in the system, only to find out that the support contract expired last month?

My own experience and the experience of the numerous companies I have worked with, have taught me that the next Black Swan is just around the corner, lurking in the dark and will hit you when you least expect it to. Heck, that’s the nature of a Black Swan.

The systems we deal with are so complex and interdependent that one could never analyze (let alone predict) the interconnections that govern the behavior of the services we offer. Luckily, statistics are on our side, so that most SaaS applications are stable most of the time and on average, we can predict the behavior over time. But that is just what creates a Black Swan – we observe a certain behavior for so long, that we tend to accept it as a scientific fact; until it bites us in the behind.

Running a complex SaaS operation with dozens (or hundreds) of servers, network boxes, configuration files, erratic software and all the dependencies we have on our infrastructure providers (power, internet, hardware, communications) is like driving a high speed car on a congested highway, blindfolded. We have no appreciation of how much Lady Luck is involved.

Keep in mind that the longer good things happen, the harder is the effect of the Black Swan event - remember the and the real-estate bubbles; most of us are still licking the wounds.

The Butterfly Effect
All it takes is an overflowing log file, that incapacitates the disk, that will bring the system down. Or a minor, forgotten gadget installed on one of the servers whose license has expired. A pipeline of requests starts filling up and there goes the system.
How about setting up an image of a new VM, whose IP and the DNS IP were reversed by mistake. Put it in production and slowly the wrong DNS IP starts propagating in the system. After a while the servers are not communicating with each other and the system freezes.
These tend to be catastrophic events, since they are so hard to detect and resolve. Many times, restarting the whole system is the chosen quick solution, praying that the problem will resolve itself. But in these cases, the system will behave just as badly, and by the time one realizes what is happening, major damage to the customers and your brand has been done.

Words of Wisdom

Do not despair. I am not suggesting that since a Black Swan event is unpredictable, there’s nothing you can do about it. The opposite is true.
The first step is to internalize the fact that it will occur, as the famous quote goes “s**t happens”.

Prepare for Failure” is my motto. Take into account that at any given moment something might break.

A number of practices should be implemented early on:
Change Management: To ensure that the events are indeed rare and that one may recover quickly with the knowledge of what went wrong.

Event Management: To be able to detect early on, what is hitting the fan, and respond to it.

Availability Management: Analyze your Single Points of Failure and impact of component failure. Build your backups, your DRP and practice recovery.

Incident Management: Make sure you cover these practices: Detection, Recording, Classification, Notification, Escalation, Investigation, Diagnosis, Restoration and Closure.

The Wise and the Smart ones
I was approached by a few (emphasis on few) CEOs and COOs that felt uncomfortable about the fact everything was going smoothly. Some were on the verge of fast growth and wanted to assure themselves that they were better prepared to hit the highway. Others had a feeling in their bones that “too good for too long” was a recipe for disaster, even if they did not read Nasssim Taleb’s book.

But many potential customers I spoke with assured me that they really do not need my services since they are doing very well, thank you. Some are still doing very well and others had a large hat to eat and many letters of regret to write their customers.

1 comment:

Gil Beth said...

One other way to minimize exposure to Black Swans - I think Taleb had a better name for it "turn the Black Swans white" Is having someone with Experience and knowledge as a part of your team. Doesn't have to be full time employee, if you want to save money have him as a consultant. If this person is good, you will Avoid many Black Swans.