Summary of Recent Cluster A Email Service Issues

By Ken Schafer on October 14th, 2008
Posted in OpenSRS Services » Tags:
Comments Off

I’d like to provide further details about poor service we provided to many of our resellers on Cluster A of our Email Service last week.

As promised, we’re conducting a detailed post-mortem but I wanted to kick things off by providing you with some high-level analysis of what happened and what action we took.

We have prepared Incident Report #2993 – October 14, 2008 (260K PDF) as the first part of our analysis.

In the coming days, we’ll be addressing some of the deeper issues brought to light by this incident through an even more technical FAQ that is currently in the works.

As our CEO Elliot Noss expressed in his Open Letter, we’re very sorry this happened in the first place and we’re determined to do everything we can to make sure it doesn’t happen again. We want to thank you for the many words of constructive advice you have provided and we can assure you that we’ll be considering every suggestion.

In Elliot’s letter, he mentioned that in addition to dedicating ourselves to reliability, we are committed to taking other elements of our email service to a new level including: monitoring, change management, emergency protocols and procedures. In the coming weeks, we’ll be posting more about our plans. As always, we welcome your feedback.

Comparison of this incident with the August service interruption

Many of you have been asking us why we have had two outages on the same cluster within a period of three months. We wanted to clarify that this was NOT a reoccurrence of the same issue that caused the service interruption in August. I have published the incident report for the August incident below to allow you to compare, but to summarize briefly:

  1. The August outage was the result of a shelf controller hardware failure. After replacing the defective hardware, we had to rebuild the RAID groups. This process had to be completed in a consecutive manner, meaning that we could only bring mailstores back online one volume at a time. After that incident, we made architecture changes that would prevent a similar hardware failure that would cause a rebuild to be triggered. (Incident Report #1991 – August 18, 2008 (344k PDF))
  2. Last week’s degradation in service was caused by two separate issues (one in the underlying Linux kernel and one in the Dovecot mail server software) which caused corruption in the mail server indexes. This led to an abnormally high server load as users trying to connect received timeout messages and then tried to reconnect. The resulting logjam as all login slots were filled led to more timeouts and degraded service for about 40% of users on Cluster A (or about 20% of all Email Service users). It took us longer to diagnose because we had to rule out a hardware problem first. After that was confirmed, further investigations had to be completed at the same time as we were moving mailboxes to new hardware in an attempt to alleviate the high server loads. Once the problems were diagnosed, we were able to work with some of the top contributors from the Linux kernel and the Dovecot mail server open source communities to develop and apply patches as quickly as possible. Unfortunately, the second bug wasn’t discovered until we had completed reindexing the mailboxes after patching the first problem, leading to a longer than anticipated service disruption.

Once again I’d like to personally apologize for the inconvenience to you our Resellers and to your customers.

We’ll include more posts on this issue and our efforts to make sure it doesn’t happen again in the upcoming days and weeks.

Tags:

No Responses to “Summary of Recent Cluster A Email Service Issues”

  1. carl doppler says:

    your communication during ANY problem stinks and who ever is the person who updates the status page should be retrained to Properly Communicate. My message here is probably not proper but then again I am not trying to update LOTS of companies on EMAIL BEING DOWN.

    During problems you act like you just don’t care about your clients.

  2. Ken Schafer says:

    We’re looking at completely rebuilding our Status Page communication tools and processes before the end of the year. Our current tool is very limited in its abilities and we also want to do more to open up communications during major system issues.

  3. John Stewart says:

    Bull***t!!! Your communications tools are not *limited* in their capabilities. Even if the GD website is HTML you can update a page every half hour if you worked nore than 7½ hours per day there. The problem is NOT “limited in capability”… it’s f*ing limited in any DESIRE to be updated by your staff. We’ve been through this with the old email, the new email, email defense, and countless other “incidents” over the past eight years. Three strikes, you’re out, kids. Quit making f*ing excuses and start addressing your problems internally: at the executive level, the technical level, the customer service level. You’re lacking in ALL areas. And in the next few days, you’re going to discover that it’s going to cost you literally MILLIONS of dollars in annual sales – and, quite possibly, a BIG class action lawsuit for breach of contract with resellers.

  4. Edward Gore says:

    Well, I am NOT impressed. Your infrastructure is way too complicated to quickly get back online after any failure, hardware or software. It takes way too long for you to even figure out if you have a hardware or software problem.

    Why didn’t you move our mailboxes to cluster B while you were screwing around with cluster A?

    This is the worst uptime performance I’ve ever experienced from any ISP for any service in my life.

  5. As a person that has run an email service, I empathize completely … it is a thankless job. I would like to add this observation: my experience with Linux is that it is fine for general use, but when under full production, it does not hold up! During the 10 years I ran the service, the Linux boxes caused problems at least weekly, whereas FreeBSD was used without an O/S incident. I will say that I stopped using the Linux boxes after the first year, so my experience is now 10 years old. I would be glad to provide details, and help if I can. My heart goes out to you folks.

  6. Nick Sugden says:

    Actually I thought your communication was quite good in comparison to other providers. I appreciate that it can be very hard to give information to customers when you do not initially know yourselves what the problem is or how to fix it. You certainly got across to me that you realised the significance of the problem and that you were deploying significant effort to fix it. I am reassured by your commitment and approach to changing your architecture and procedures to prevent such problems recurring. Well done!

  7. gsyoungblood says:

    Try #2 – I just tried to write a comment and got this:

    This error (HTTP 405 Method Not Allowed) means that Internet Explorer was able to connect to the website, but the site has a programming error.

    For more information about HTTP errors, see Help.
    :(

    Trying one more time…


    OpenSRS Reseller/User group on LinkedIn.
    http://www.linkedin.com/e/gis/1012737

  8. gsyoungblood says:

    OK, that worked, so I’ll type this again.

    Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there?

    It also sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage?

    Finally, please review the numerous comments in response to “Cluster A Email Service Issues” on October 7th and answer the questions myself and others asked. Some have been answered, and I thank you for them. Also, thanks for the incident report. There are still more questions I believe. If I have time I may try to extract a list and repeat them here.

    OpenSRS Reseller/User group on LinkedIn.
    http://www.linkedin.com/e/gis/1012737

  9. Augustin L says:

    Dear Tucows,

    Maybe it’s linux fault, maybe it’s Dovecot, or maybe it doesn’t worth trying to find the frontier of knowledge on a production environment by having too many users on a single system. Restore time on a huge partition is always crap. Same thing applies when reindexing too many mailboxes.

    Splitting your architecture into smaller but more manageable systems couldn’t be an option?

    After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a a backup system were empty mailboxes would have already been created in advance.

  10. Re:…[We’re looking at completely rebuilding our Status Page communication tools and processes before the end of the year. Our current tool is very limited in its abilities and we also want to do more to open up communications during major system issues.]…

    I would like to request your consider 2 things when you rebuild your status page:

    1) Make available the information via an API or RSS feed of some description, so that we (as a reseller) can publish the content on a status page on OUR OWN Web site to point our angry clients.

    2) saying “40% of client of cluster A” is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – again via an API would be useful, we could then work with our client that ARE affected!

  11. enoss says:

    @carl I would love to talk more about the communication bit. in fact we all would. there is such a fine line between putting out incomplete info while investigating and providing as much visibility as possible.

    at times we just felt damned if we do and damned if we don’t. it was not for lack of resources. during a situation like this we have around the clock war room with a meeting every two hours and call in from outside for those not here.

    I will email you offlist as I would love to run some scenarios by you.

  12. Ken Schafer says:

    @Edward Gore @4 – We’ll address the “why didn’t you move me” question in a technical FAQ we’ll be posting soon.

  13. Ken Schafer says:

    @James Stewart @5 – Thanks for the kind words. The engineers are reading the comments as well so the FreeBSD suggestion will be seen.

  14. Ken Schafer says:

    @gsyoungblood @8 – We’re publishing a technical FAQ soon that will address some of the more, ugh, technical questions, and yours should be covered in there as well.

  15. Ken Schafer says:

    @Paul O’Hanlon @10 – There will definitely be a feed and email notification option for all services via the new Status Page.

    We hadn’t considered the API but that’s interesting.

    In general we recommend AGAINST automatically taking our Status and pushing it to your customers. We’re writing to our Resellers and will at times use references and descriptions of services that may not be appropriate for your users.

    I can imagine a day where an API call could give actual status (i.e. online, degraded, maintenance, offline) but as soon as we go beyond that I think it best that resellers put context around what we’re saying that makes sense to your customers.

    Thanks for the suggestion and the kind words!

  16. Ken Schafer says:

    Thanks to everyone for the technical questions. I’ll make sure they get passed on to the folks writing the Technical FAQ.

    And thanks for the kind words.

    And I’m sorry that some of you are so frustrated by how we’ve handled this – we’ll try harder and I hope to recapture your respect and approval through our actions in the future rather than words here.

  17. Roger Davies says:

    Whilst we were all badly affected by (both of)the outages I personally appreciate the lengths that OoenSRS have gone to to keep us informed of progress, videos, open letters, offers to contact my end users for me, are all excellent responses to a big big problem.

    Of course I would prefer it if the service just stayed up but life is not like that, and the way the problems have been handled just goes to prove to me that I have picked the right supplier, if only my other suppliers we’re this efficient, mentioning no names (ahem..BT, Dell, Demon Internet(DSVR), Eclipse Internet, LlloydsTSB) they could all take a leaf out of OpenSRS’s book of how to manage a problem when things have gone wrong. Not all my other suppliers are rubbish though, OpenSRS are 1 of 3 who I rate, the other two are forlinux.co.uk who phone me up if there is a problem, usually before i have noticed, and RBR Motors who look after all my cars, without these three firms I would have lost all faith in ANY supplier so well done OpenSRS for being able to join the lofty ranks of ‘Rogers Rated Suppliers’, but let’s just keep everything running from here on it eh? :-)

  18. Donald says:

    Life in our small call center was hell last week.

    I found that the information provided on the status page was enough to tell us that you were working your butts off. Every two hours we’d hope against hope that we’d be getting a solution up and running soon, and the glorious update that said “Everything will be fine after X time”. WE know you can’t give that kind of update often, but that was the frustrating part, not so much that “your communication sucked”, as others have said.

    Thanks for the update, and the feedback section.

  19. Troy Jones says:

    I must say that my company is not a long-time customer but in the short time we have been affiliated with Tucows email, it has been less than adequate. There are many issues, but this is simply inexcusable. As an email server admin myself, I do recognize that certain problems are bigger than others but this just smacks of lackadaisical effort and lack of proper fail-safe and redundancy practices. One of my clients, a major real-estate company, couldn’t get their email for a full 6 days. This is the second time a server problem has shut them down. They do a vast majority of their business online and now we are in a position of having to explain this all to them with them really not caring and considering other options. I can only imagine that bigger resellers have even bigger issues but this has soured us to the point that we are unlikely to continue service.

    Because I understand that even the best of companies has issues from time to time, I will not attack any particular group personally but as a whole, any company who has a “100% uptime guarantee” should have some real intention and ability to provide that. Otherwise, this is what happens – people like you having to explain away at a problem that the clients already are too pissed off to really care about what the problem was.

  20. Akira says:

    Since the service on cluster A has recovered, some of our users are experiencing another problem.
    They can send and receive but sometimes their message with attachment delivered with strange header (no sender,no TO, no date no subject).
    Since the header is corrupt, attachment file is not decoded by e-mail client software.
    Is this any related to this outage??

  21. Ken Schafer says:

    @Akira @20 – I doubt that’s related to the problems we had but I’ll pass it on to Support. Generally it’s better to shot problems like this directly to support in the first place as they’re better prepared to help with specific issues like this.