The recent outage suffered by the Amazon Web Services cloud is another stark reminder that despite the best efforts of service providers, outsourced hosting and cloud infrastructure is fundamentally imperfect. Obviously you don’t want outages to happen very often and there is plenty service providers can do in the areas of engineering and software architecture to minimize risk and impact. But the ideal of 100% uptime, while desirable, is impossible to achieve. Things can and will go down. You can bet on it. Because of this simple fact, service providers need to understand that it is not just about preventing and avoiding outages. It is about how you deal with them when they unfortunately do happen. I
The most important thing to do when an outage happens is to be transparent. There is no getting around this and it can’t be emphasized enough. While it might hurt to tell your customers that the failure in your service delivery has caused a disruption to their business it is a necessary position to take. Being secretive, elusive or worse, dishonest, breaks down the trust a customer has placed in the hands of the service provider when they hand over their mission-critial content and applications. Break this trust and there is nothing left.
Communication is key
Transparency is the first step. The next step is communication. And not just communication but effective and proactive communication. When there is an outage hosters must go on the offensive and reach out to customers directly … and fast. You don’t want customers going to Twitter for information about your service. You don’t want to risk having customers get false or misleading information. To mitigate that from happening hosters must communicate accurate and detailed information immediately. Get on Twitter, post on a company blog or Facebook page. Send a mass email to customers. Call as many customers as you can or at least call your top customers. And be sure to keep the updates coming regularly through and after the outage. Not reaching out to customers could result in customers making decisions about your service based on false information. And this almost always will have an undesirable outcome.
Being transparent and providing clear and accurate information also plays a role before the outage happens. Be careful to look over your SLA terms with a lawyer and make sure you can deliver on all the stipulations. Both the customer and provider need to have a frame of reference for when things go wrong. A concise SLA sets expectations for customers and delineates service provider responsibilities. It also keeps you out of legal hot water. And be sure not to promise what you can’t deliver on. The 100% SLA is just a disaster waiting to happen. Even if you have liberal definitions of downtime, when things go wrong you are going to look pretty silly with a claim of 100% downtime. Just don’t do it.
This is not to say, however, that covering your bases is enough. This is just a staring point. Preparation and a legal frame of reference is just a foundation that shapes a conversation. Hosters still have to engage in that conversation and it is often with customers that are downright irate. Going above and beyond for your customers might be a necessary response. Understand how far you are willing to go for customers in the event of an outage well before it happens.
It is worth repeating. Things can and do go wrong. A good outage management strategy requires preparation, realistic expectations, honesty and transparency. An infrastructure service provider is responsible for the lifeblood of businesses. It is a relationship that runs on technology but is grounded in trust. Preserving that trust is the key to maintaining good relations with customers when things go wrong. Customers understand that the Internet is perfect and always will be. All they want is a technology partner that appreciates this and acts accordingly.
Thanks to Andy Piper for the pic, and for releasing it under Creative Commons.