Summary of Recent Cluster A Email Service Issues
I’d like to provide further details about poor service we provided to many of our resellers on Cluster A of our Email Service last week.
As promised, we’re conducting a detailed post-mortem but I wanted to kick things off by providing you with some high-level analysis of what happened and what action we took.
We have prepared Incident Report #2993 – October 14, 2008 (260K PDF) as the first part of our analysis.
In the coming days, we’ll be addressing some of the deeper issues brought to light by this incident through an even more technical FAQ that is currently in the works.
As our CEO Elliot Noss expressed in his Open Letter, we’re very sorry this happened in the first place and we’re determined to do everything we can to make sure it doesn’t happen again. We want to thank you for the many words of constructive advice you have provided and we can assure you that we’ll be considering every suggestion.
In Elliot’s letter, he mentioned that in addition to dedicating ourselves to reliability, we are committed to taking other elements of our email service to a new level including: monitoring, change management, emergency protocols and procedures. In the coming weeks, we’ll be posting more about our plans. As always, we welcome your feedback.
Comparison of this incident with the August service interruption
Many of you have been asking us why we have had two outages on the same cluster within a period of three months. We wanted to clarify that this was NOT a reoccurrence of the same issue that caused the service interruption in August. I have published the incident report for the August incident below to allow you to compare, but to summarize briefly:
- The August outage was the result of a shelf controller hardware failure. After replacing the defective hardware, we had to rebuild the RAID groups. This process had to be completed in a consecutive manner, meaning that we could only bring mailstores back online one volume at a time. After that incident, we made architecture changes that would prevent a similar hardware failure that would cause a rebuild to be triggered. (Incident Report #1991 – August 18, 2008 (344k PDF))
- Last week’s degradation in service was caused by two separate issues (one in the underlying Linux kernel and one in the Dovecot mail server software) which caused corruption in the mail server indexes. This led to an abnormally high server load as users trying to connect received timeout messages and then tried to reconnect. The resulting logjam as all login slots were filled led to more timeouts and degraded service for about 40% of users on Cluster A (or about 20% of all Email Service users). It took us longer to diagnose because we had to rule out a hardware problem first. After that was confirmed, further investigations had to be completed at the same time as we were moving mailboxes to new hardware in an attempt to alleviate the high server loads. Once the problems were diagnosed, we were able to work with some of the top contributors from the Linux kernel and the Dovecot mail server open source communities to develop and apply patches as quickly as possible. Unfortunately, the second bug wasn’t discovered until we had completed reindexing the mailboxes after patching the first problem, leading to a longer than anticipated service disruption.
Once again I’d like to personally apologize for the inconvenience to you our Resellers and to your customers.
We’ll include more posts on this issue and our efforts to make sure it doesn’t happen again in the upcoming days and weeks.



21 comments so far →
Sorry, comments are now closed.