blog

New “Soft-Suspension” Process for OpenSRS Email Service

Effective Tuesday, February 9th, 2010, we’re implementing a new “soft-suspension” process for all users of OpenSRS Email Service. This new process is intended to mitigate the potential effects of spam sent by users of OpenSRS Email Service and to ensure good deliverability and reliable service for all users.

Full information about the change has been sent to all OpenSRS Email Service Resellers. Full details are available in the Resource Center – direct link.

Bulk Migrating Mail Into OpenSRS Email Service

Bringing more tools down to the Reseller level from the support or Professional Services level is something we’ve been focusing on for OpenSRS Email Service. You might remember that we recently made it easier to have deleted email and mailboxes restored without requiring customer support intervention.

Bulk Migrate End-user Mail from Other Email Services

Just last week we rolled out the ability to migrate user mail into OpenSRS Email Service mailboxes in bulk. This means Resellers are now able to create a bunch of new accounts on the service, and then move existing mail data from outside servers into those accounts in bulk. This simplifies and streamlines the process of moving users from other email services and servers onto OpenSRS Email and diminishes the need for support or Professional Services involvement on smaller mail migrations.

There’s more detail on page 43 in the Mail Administration Center (MAC) User’s Guide for those looking to start using this tool. If you want to get a sense for how the tool works, here’s a tutorial screencast that shows at a higher level just how it’s done.

What Do You Think?

We’d love to hear what you think of the tools that we’ve already brought down to the Reseller level, be it the ability to restore deleted mail or this new tool to migrate end-user mail over to OpenSRS Email Service mailboxes. I’ve started a forum thread if you want to comment and add your opinions or thoughts. As always, comments right here at the blog are always welcome.

Webmail Version 5.5 Preview now available in Test

As mentioned last week, we’re ready to roll out the latest version of our webmail application to Resellers. Webmail 5.5 is now available in the Test environment. We encourage email resellers to login to the Test environment, enable it on your test account and take it for a spin.

Webmail 5.5 is part of the continuing evolution of our webmail application. The overall look and feel is very similar to what users are accustomed to and the transition should be relatively seamless for most.

Here’s what’s changing in Webmail 5.5:

Features:

  • Added a Calendar: The calendar features a full “drag-and-drop” interface and is available in both the standard and basic interfaces. Calendars can be shared within the same domain if the end-user chooses to do so. Users can import calendar data from standard ical/.ics format calendar files or from a public ical/.ics format URL. Calendars can be exported to a standard ical/.ics format file. Resellers can choose whether to offer the calendar to their users using the Branding Tool.
  • Added RSS Feed Reader: Users can now subscribe to and read RSS feeds within webmail. Users can enter their own RSS feeds or select from a predefined catalog of popular feed sources. The ability for Resellers to choose the feeds within the catalog is planned for a future release, but for now, the contents of the catalog is determined by OpenSRS and contains a broad range of different sources. The RSS Feed Reader is currently only available in the standard interface. As with the Calendar, the RSS Feed Reader can be enabled or disabled by Resellers via the Branding Tool.
  • Improved “Basic” interface: The Basic Interface has been upgraded to support all features of the Standard Interface with the exception of the RSS Feed Reader feature. The Basic Interface also now includes a rich text editor and spell checker in the email compose area.

Performance:

  • Improved overall speed and responsiveness: Much work was done to optimize webmail across different browsers and for users on slower connections. The result is a dramatic improvement in speed and performance. The improvement is felt across all browsers, but is most noticeable on Internet Explorer and for users with large mailboxes.

Usability:

  • Added Dashboard hiding: A new “Dashboard” double-arrow button allows the left-hand folder pane area to be hidden and restored.
  • Improved folder icons: Core folders now have custom icons to make them easier to identify.v
  • Added/reorganized message list columns: We’ve added Reply-to/Forward Flags, the ability to mark a message as flagged/unflagged. Block/Safe Sender functions have been moved to the “More Options” drop-down. A Message-ID column heading and sort option has also been added. The “Attachment” column has been moved to the right side of the message index display.
  • Improved Search: Users can now choose to search by sender, subject, message headers or message bodies. The default continues to be search within the message headers.
  • Added a full message source viewer: Users can view the full, raw source of a message.
  • Improved email composition: We added the ability to request a “read receipt” when sending a message. Users can now also set message priority when composing and sending email. An auto-save feature has been added that automatically saves composed messages as a draft periodically. Spell-check has been added to the basic webmail interface. The rich text editor has been improved to include more formatting options.
  • Improved the Address Book: Added the ability to import from Mozilla Thunderbird address book files.
  • Improved the General Settings area: The Interface Preferences and Display Preferences have been combined under a single tab called “Display Preferences.”
  • Added Secure POP3 Mail retrieval: POP3 external mail fetching now supports SSL connections.

OpenSRS is “Reseller Friendly” and that motto is something we think about every day when making decisions about new products and features on existing products, like OpenSRS Email Service. While we’re excited to get this new webmail out to users, we understand that this kind of upgrade can have an impact on your support departments, and that email resellers will want to provide some warning to their users to help them get up to speed on the new version. With that in mind, we’re rolling out Webmail 5.5 in a preview program for the first little while.

Here’s how the preview program works:

  • Participation in the preview is recommended, but optional, and the default is to not show the preview to end-users. Resellers can enable the preview through the branding tool in the Mail Administration Center (MAC).
  • When the preview is enabled, users are offered a choice between the new webmail interface or the current webmail on the login screen.
  • At the end of the preview period (which will run for at least eight weeks), the preview option will be removed, and all users will automatically use the new webmail.

Throughout the preview period we’re very interested in hearing from you about the new webmail application and how your customers are responding to it. There is a feedback form that is integrated into the login screen for webmail where users can share their thoughts.

Webmail 5.5 was promoted to the Production Test Environment on Thursday, April 23rd, 2009. We suggest that all of our OpenSRS Email resellers to log into the Test environment to see both how to enable the preview for your users, and also to see the improved webmail interface.

Right now, we’re on track to make the preview available on the Live servers for both Cluster A and Cluster B beginning on Thursday, April 30th, 2009. At that time, you’ll be able to login to the MAC and enable the preview for your users. As mentioned, the preview is disabled by default; it is up to you to determine when you are ready to begin the preview.

Here’s a short screencast that shows how to enable the webmail preview using the Branding Tool within the Mail Adminstration Center.

Webmail 5.5 Preview Release Coming Soon

Since we re-launched our OpenSRS Email Service on our own, unified email platform in mid-2007, we’ve been very busy working to make it even better. A lot of what has been done was focused on the back-end infrastructure – making the service faster, more reliable and more scalable. But not all the work has been focused on the “plumbing.” For the last little while, we’ve also been working on a new release of the webmail application, version 5.5, and it’s nearly ready for its debut.

preview_calendar_insetOver the next few months we’ll be providing our email customers with a way to let their users take the next release of our webmail application for a spin. This opt-in “preview” will run for at least eight weeks between when we make it available on the live servers, to when we retire the old webmail application and make version 5.5 the default webmail for all users.

One important thing to note is that this is not a “beta” program. We’re bringing the new webmail out as a completed application, but in a preview mode. The goal is to get some feedback on some new features we’re adding, but more importantly, we want to ensure that our email resellers can ease into the new webmail with as little impact as possible.

Better Performance, Calendar and RSS

This webmail upgrade is really part of the continuing evolution of our current webmail application. It’s really not all that different from an end-user perspective when it comes to reading and sending mail, managing contacts and doing all of the things you do with webmail.

preview_rss_insetThe big change is in the performance of webmail. We’ve done a ton of work to optimize webmail and the result is a dramatic improvement in speed and performance. It’s felt across all browsers, but the difference is most noticeable on Internet Explorer. It’s much more responsive, especially for users with lots of mail.

And aside from the new stuff “under the hood,” there are a couple of new features we’re adding, namely a calendar and RSS feeds.

The calendar is a full-featured, web-based calendar with drag and drop, as you’d expect. The RSS reader allows users to subscribe to, and read RSS feeds alongside their mail.

We’ll be sharing more about the preview program and the rollout of webmail in the coming days. Rest assured, it will be up to you to decide when and how you’ll bring this new webmail upgrade to your customers during the transition period.

Letter to OpenSRS Email Resellers

Dear Customers -

On behalf of all of the members of the OpenSRS team, please accept our sincere and deepest apologies for the service disruption on Cluster A this past weekend.

Many of you have asked, “How could we have let this happen again?” We initially were led to believe that we had a software problem. We have now determined that the string of service problems on Cluster A are related to a hardware problem inside one of our NetApp devices.

Below is a letter of explanation I received from Jeff Goldstein, General Manager at NetApp Canada.

We are not without fault in this situation. Network-attached storage is complex and we trusted our vendor to provide us with accurate advice related to our problems. In hindsight, we should have pressed earlier for replacement hardware.

Please rest assured that we are dedicated to providing a reliable email service and will be working tirelessly to restore your confidence in us. An incident report is available at OpenSRS Status.

Sincerely,
Elliot Noss,
President and CEO, Tucows

Dear Elliot Noss,

I am writing today regarding the recent outage that occurred this past weekend with Cluster A of the OpenSRS Email Service.

As you are aware, Cluster A of the OpenSRS Email Service has experienced a number of service degradations related to issues with our NetApp storage device. Our engineers here at
NetApp worked closely with the technical operations and development teams at OpenSRS to trouble-shoot and resolve these issues. In each of the cases, we believed a software
fault was the cause.

The intermittent problem turned out to be due to the hardware shelf controller as well as firmware in one of our NetApp storage devices, which caused the issues on Cluster A.

We are deeply sorry for the inconvenience that resulted from these hardware and email service issues.

One of the promises we make to our customers is that our solutions provide highly available data management and in this case we let you down.

To begin to resolve this issue, we’re taking immediate action to replace the hardware and firmware in Cluster A at our expense. Our engineers will then test and evaluate the components involved to determine what specifically went wrong and apply those findings back into our own quality control
teams.

Our two companies have been working together for the past nine years. We value our relationship and will work hard to restore your confidence in NetApp and our solutions.

Again, please accept our sincere apologies.

Regards,

Jeff Goldstein
Canadian General Manager
NetApp Canada

The new Spam Settings page for OpenSRS Email Service

As mentioned in the previous posting, we’re in the midst of rolling out a new release of OpenSRS Email Service. The most visible of the changes that will be promoted to the live service next week is the end-user spam management settings page. To help you out, I’ve prepared a short screencast to show you what the end-user experience will be.

You can test it out for yourself in the Production Test Environment (PTE). The new release was promoted to PTE earlier today.

The approach to OpenSRS Email Releases

Earlier today, we promoted the latest version of code for our email service to the Production Test Environment (PTE). If you’re one of our email resellers, you should have already received an email from us letting you know about the release, so you can familiarize yourself with the changes before your users get them a week from now. Complete information about what’s changing in this release can be found on the release notes page.

In addition to feature releases, we’re constantly working to improve performance and reliability. Those releases where there is no end-user or reseller impact, beyond an improved overall experience, usually happen ‘behind the scenes’ and fairly frequently.

While we’re talking about releases, I wanted to take a minute to explain our approach. In general, we have two main goals in our releases:

  1. Address or remove bugs wherever and whenever we can.
  2. Add new features that provide the widest possible benefit across both the end-user and reseller user bases.

For example, in this release:

  • To help both end users and resellers, we added a new settings page that gives users the ability to change how their spam is handled. In particular, POP users can now choose to have their spam tagged and delivered to their Inbox. Then their spam will get downloaded with the rest of their mail, and they’ll no longer have to use Webmail to check the contents of their Spam folder. We have a screencast showing this functionality in another blog post.
  • To help end users, we added the ability to export contacts. Users could previously import contacts from Outlook-format address book files, and now they can export them as well.
  • To help resellers, we’ve added a way to mitigate the effects of phishing attacks targeting their user bases.

Our mission with OpenSRS Email Service is to provide an easy-to-use experience that gives the ‘power user’ enough functionality to keep them happy, while not overwhelming the average user with gizmos and whiz-bang that does nothing to help them read their mail quickly and easily.

Technical Debrief on October Cluster A Email Service Issue

Aj Mirani is Manager of Technical Operations for OpenSRS and is responsible for coordinating the strategic aspects of technical issues, long-term capacity planning and resource allocation for technical projects. He leads the team of Unix administrators and network administrators directly responsible for running the servers, network and storage devices across all platforms. Below he’s answered questions raised by resellers during last week’s service disruption.

Definitions

Linux: a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. OpenSRS uses the Linux operating system on all of our services.

Dovecot: an open source IMAP and POP3 server for Linux/Unix-like operating systems. OpenSRS uses the Dovecot mail server software across our Email platform.

NetApp: a storage and data management provider used by OpenSRS to manage our email service. This technology is used by many major service providers.

Bugs

Can you describe the Linux kernel bug that was found?

In 2.6.19, a patch was introduced to the Linux kernel by developer Neil Brown which added an “optimization” whereby if you have only TCP NFS mounts, the Linux kernel on the client will not listen for UDP NFS lock callback messages, as it was believed that an NFS server would always send lock messages over TCP if all mounts were also TCP.

However well-intentioned, this optimization does not hold true in all cases. For example, Netapp filers will still perform UDP NFS lock callback messages with their clients, even if they are using TCP NFS for all volume mounts.

This is not a bug on the part of Netapp, since there is no specification in the NFS RFCs that TCP mounts necessitate 100% TCP lock messages. Rather, it’s up to the NFS client to be available and listening to all NFS lock messages, whether TCP or UDP.

Moreover, Netapp explicitly chose to use UDP for some lock callback messages as the overhead on short messages of this type is significantly less in UDP than it is in TCP. In short, it scales better to do so for these messages.

The end result is that the NFS clients (IMAP servers in this case) were not able to perform full NFS locking all the time, and that resulted in clobbered writes, leading to corrupted indexes which necessitated full Dovecot index rebuilds.

Can you describe the bug that was found in Dovecot?

There were in fact two bugs found in Dovecot, both of which ultimately led to the same end result, that being a bad index file.

First, if a user logs in to their mailbox and Dovecot detects a problem with their index, it will attempt to reindex their messages. Should that user’s connection be closed for any reason, Dovecot will detect this but still continue with the reindexing if it has not yet completed it. This is good because, in terms of resource utilization, the reindex is a very expensive operation and Dovecot doesn’t want to have to do this more than necessary. The bug, ironically, is that once Dovecot went through the effort of reading every message in the user’s mailbox and it came time to actually write the index, a subroutine would detect that the user was no longer connected and abort the final write operation.

Second, the virtual size is the size of the message within the POP3 protocol, which can differ from the size of the message on disk. Under some conditions, Dovecot would do the work of rebuilding the table of virtual sizes, but would not ever write it out. The POP3 session would function normally from the user’s point of view, as the virtual size table was now in RAM, but the next time the user logged in, the virtual sizes would have to be reconstructed again, which caused a rescan of all mail in the mailbox.

Linux Questions

What distro and version of Linux are you using?

We have standardized on the Debian ’stable’ release as a base with security patches. It has a proven track record with us and we like the rigorous prerequisites which need to be met before packages can be considered for this release. The following article describes the life-cycle of packages within Debian distributions prior to being considered for the ’stable’ release:

Debian Package Life Cycle (Wikipedia)

During our initial architectural design of the mail platform, there was a lot of heated debate as to our selection of standard OS and hardware. In the end, after looking at all the other options, including Solaris and FreeBSD, we decided on Linux as the best candidate. We still believe this was the best choice not only from a reliability standpoint, but also from a performance perspective.

It sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage? (Greg Youngblood)

We are very cautious around any changes made to the production environment. Even small changes are made with a healthy amount of paranoia. Something as major as a kernel upgrade is not taken lightly and goes through a lot of scrutiny before reaching our production servers. Even when we consider an upgrade fit for production, we start with limited deployments on non-customer-impacting servers. As the upgrade proves itself reliable, we begin rolling it out to other environments. To give you an idea of how long this can take, this particular kernel spent about 45 days running problem-free across approximately 400 servers in our production environment prior to its final rollout to the Dovecot servers. This was after it spent significant time first in our development and subsequently in our QA environments. We do use NFS extensively across all of these environments. We specifically chose this release because of the major enhancements for NFS that it included. Unfortunately, even with rigorous testing, the reality is that in any environment, bugs do make it into production sometimes. We have many layers of protection in place to mitigate these. In this particular case, the combination of the problem with the kernel in conjunction with the Dovecot bug quickly pushed our load to literally ten times what we normally experience. We do make sure our systems have plenty of spare capacity and we can handle a lot of extra load. Unfortunately, not ten times regular capacity as we experienced last week.

Have you considered switching from Linux to Solaris? Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out. (Greg Youngblood)

This may be the case for NFSv4 at the moment. We don’t believe that NFSv4 is proven enough for us to use in production for now. From Netapp’s perspective, they strongly recommend using the ‘General Availability’ (GA) version of Ontap (the platform OS) if we are going to implement NFSv4. We are not comfortable using anything other than the ‘General Deployment’ (GD) releases however as these are the production-proven versions. Ontap GA is the equivalent of beta for Netapp and only has a very limited production deployment across their customer base. The GD releases are the most widely deployed, most stable versions. The short answer is that we won’t feel NFSv4 is suitable for production for at least another six to twelve months.

Dovecot Questions

Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there? (Greg Youngblood)

We worked with Timo Sirainen, author and primary maintainer of Dovecot, to patch the Dovecot source. These patches are currently in production across our production environment. Future versions of Dovecot will include these changes as standard.

If you implement bounces (preferably timeouts during delivery so messages stay in queues and don’t get lost assuming you can deliver them within 5-7 days), can you make it adjustable (if possible) with a setting in MAC or reseller interface? I can certainly understand why some would want bounce notifications, but probably not everyone will. I am on several lists that auto unsubscribe you on bounces, so for myself personally I prefer not to have them bounced. (Greg Youngblood)

Once we accept mail into our system, it will not bounce. Our architecture is tiered such that even if the underlying mail servers are unavailable, mail is queued internally until such time as it is delivered. Mail that is rejected does bounce at the perimeter only.

What can be done by OpenSRS about secondary MX records? Whilst emails are being queued by the primary MX, I presume that nothing would go to a secondary MX.

Currently the only way mail would go to the secondary MX is if the primary MX does not accept mail. Even through the worst SPAM attacks we have had sufficient capacity to still accept valid mail. We have yet to experience this situation.

System Architecture/Miscellaneous Questions

What type of monitoring system are you using? Why wasn’t this abnormal system behavior caught by your monitoring system in the early stages?

We are using a number of systems to trend and monitor. Primarily we use Nagios for monitoring with the help of numerous custom plugins we developed to provide a more robust testing suite. In Nagios alone we have in excess of 1500 monitoring points across the Email platform. Furthermore, we use Munin to trend additional metrics for long term planning and visibility into scalability trends.

On a micro-scale of say a few thousand users, these bugs were individually negligible (and extremely difficult to detect.) The user experience would have been normal during this period. It took many days of digging by an assembled team of our top engineers and system administrators working in shifts 24/7, in conjunction with some of the best developers in the Open Source world and Tier 3 NetApp engineers in order to nail this down.

The cumulative nature of the kernel locking problem combined with the Dovecot reindexing bug, both compounded by millions of users accessing the system, was the confluence of issues that caused the outage.

As a result of this incident we have added a number of additional monitoring points that will add insight and provide an early warning in the future. While it would be almost impossible to detect the specific failure, we’re in a very good position to detect and track things like locks and other NFS/Netapp interactions.

Why wasn’t Cluster B affected? (Greg Youngblood)

Cluster B, much like the half of Cluster A which remained online was, in fact, affected. Both Clusters use identical hardware and software. We were able to resolve the issue before we crossed the cascade threshold on those environments.

Why didn’t you move our mailboxes to Cluster B? (Edward Gore)

The short answer to this is that it would take well over a week to migrate every user from Cluster A to Cluster B. We are not able to disclose the exact amount of space being used by mailboxes, but it is measured in many terabytes. Cluster B resides in an entirely separate geographically different data center, so we would be limited to Internet transfer speeds between these two data center providers in order to conduct such a migration.

Saying ‘40% of clients of Cluster A’ is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – via an API would be useful, we could then work with our clients who ARE affected! (Paul O’Hanlon)

Our hashing algorithm that distributes users across the cluster is highly efficient. We have found that it very evenly distributes users on a large scale. It is possible for us to provide lists of users on affected mailstores although it is unlikely it will be a feature that will be implemented into the API. In the case of last week’s incident, producing those reports would have pulled our engineers away from resolving the root cause and restoring mailboxes.

After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a backup system where empty mailboxes would have already been created in advance. Have you considered this idea? (Augustin L)

We have considered this and are still investigating options. There are some considerations which need to be carefully weighed, especially surrounding clients such as Outlook/Thunderbird/MacMail coming out of message UID sync with the backend when all of the sudden historical mail is not available. One possibility is to offer a webmail-only emergency solution, though this is also not entirely ideal. We’ll keep you posted.

Is this reindexing you did the same or similar to what you did in August?

No, this was a file level reindex of the users mailbox. In August we had multiple hardware failures that resulted in hard-drives requiring a RAID level rebuild from parity.

Shouldn’t the system reindex itself?

Yes, it should, and once we put the Dovecot software patches in place the reindexing was successfully completed by the system itself. Cluster B and half of Cluster A were left to naturally reindex but we chose to take part of Cluster A offline entirely in order to perform a global reindex because of the sheer number of mailboxes which were affected in that specific portion of the Cluster. This reduced the time to restore service.

While mailboxes are evenly distributed, user access fluctuates and is not entirely predictable on a large scale. While access load does balance itself out in the grand scheme, there are times when some portions of the cluster are more used than others.

Why wouldn’t you have redundancies in place to avoid this?

There are many layers of redundancy currently in place on both the hardware and software fronts. Redundancy was not the solution in this particular case. We would have needed to have an order of magnitude more capacity/redundancy to be able to ‘weather’ this, and even then it likely would not have been enough. The best way to avoid this type of situation in the future is through early detection and I strongly believe we’re in a very good position to do that now.

Might splitting your architecture into smaller but more manageable systems be an option? (Augustin L)

Our current architecture evolved from that type of environment so we know firsthand the caveats of splitting the system into smaller sub-systems. Splitting up the Clusters further would, in fact, make them less manageable and less able to distribute load. During our time running the old platform we saw a lot of this and it translated to a much more unpredictable user experience. In short, smaller sub-systems do not scale well.

However, we have engineered in the benefits of smaller manageable systems, as evidenced by only a portion of the cluster being affected.

Summary of Recent Cluster A Email Service Issues

I’d like to provide further details about poor service we provided to many of our resellers on Cluster A of our Email Service last week.

As promised, we’re conducting a detailed post-mortem but I wanted to kick things off by providing you with some high-level analysis of what happened and what action we took.

We have prepared Incident Report #2993 – October 14, 2008 (260K PDF) as the first part of our analysis.

In the coming days, we’ll be addressing some of the deeper issues brought to light by this incident through an even more technical FAQ that is currently in the works.

As our CEO Elliot Noss expressed in his Open Letter, we’re very sorry this happened in the first place and we’re determined to do everything we can to make sure it doesn’t happen again. We want to thank you for the many words of constructive advice you have provided and we can assure you that we’ll be considering every suggestion.

In Elliot’s letter, he mentioned that in addition to dedicating ourselves to reliability, we are committed to taking other elements of our email service to a new level including: monitoring, change management, emergency protocols and procedures. In the coming weeks, we’ll be posting more about our plans. As always, we welcome your feedback.

Comparison of this incident with the August service interruption

Many of you have been asking us why we have had two outages on the same cluster within a period of three months. We wanted to clarify that this was NOT a reoccurrence of the same issue that caused the service interruption in August. I have published the incident report for the August incident below to allow you to compare, but to summarize briefly:

  1. The August outage was the result of a shelf controller hardware failure. After replacing the defective hardware, we had to rebuild the RAID groups. This process had to be completed in a consecutive manner, meaning that we could only bring mailstores back online one volume at a time. After that incident, we made architecture changes that would prevent a similar hardware failure that would cause a rebuild to be triggered. (Incident Report #1991 – August 18, 2008 (344k PDF))
  2. Last week’s degradation in service was caused by two separate issues (one in the underlying Linux kernel and one in the Dovecot mail server software) which caused corruption in the mail server indexes. This led to an abnormally high server load as users trying to connect received timeout messages and then tried to reconnect. The resulting logjam as all login slots were filled led to more timeouts and degraded service for about 40% of users on Cluster A (or about 20% of all Email Service users). It took us longer to diagnose because we had to rule out a hardware problem first. After that was confirmed, further investigations had to be completed at the same time as we were moving mailboxes to new hardware in an attempt to alleviate the high server loads. Once the problems were diagnosed, we were able to work with some of the top contributors from the Linux kernel and the Dovecot mail server open source communities to develop and apply patches as quickly as possible. Unfortunately, the second bug wasn’t discovered until we had completed reindexing the mailboxes after patching the first problem, leading to a longer than anticipated service disruption.

Once again I’d like to personally apologize for the inconvenience to you our Resellers and to your customers.

We’ll include more posts on this issue and our efforts to make sure it doesn’t happen again in the upcoming days and weeks.

Open Letter To Our Email Service Resellers

Dear Resellers,

I am writing today to speak to you directly about what happened this week with Cluster A of our Email Service. This will not refer to specific elements of the outage, there are other venues for that. The things I most want to communicate are my deep sorrow, why it won’t happen again and what we will do for you.

More than anything one thought keeps going through my mind as I think about this, the future determines the past. I will return to this thought.

First, and most importantly, we are sorry. I am sorry. I have been in this business a long time and do not know if I have ever been more sad about what we have done to you, to your customers and to how people think about us. An email outage in 1995 was different from one in 2000 and even more different from one in 2008. I know what this does to your reputations, to your customers and to your staff – and I and so many people here are just sad about that.

While it seems trite right now, we really define ourselves by how we make it easier for you in your businesses and with your customers and in our deep understanding of those relationships. That means the pain here is that much greater and believe me I know our pain here does not matter, yours does. Just know we are grieving.

Second, what will we do about it and why will this never happen again? I know for some of you that doesn’t matter, you are done with us, but I want to express this for the rest of you. Let me start here with things that were not the problem, old equipment, people, capacity or redundancy. The equipment is new, the people are great, we have plenty of capacity and redundancy. What this will mean for us is clearly the need to take the other elements of the service to a completely new level. Here I mean monitoring, change management, emergency protocols and procedures and operating efficiencies.

We had decided long before this that the most important part of email was reliability, not features, not groupware, not web 2.0 integration but reliability and deliverability. I have been at this a long time and really believe that these people and this service can be the best in the world, better than Google, Yahoo or Microsoft and most importantly the best partner for service providers. We owe you this and will deliver it.

Lastly, what we will do for you as a result of this? Let me start here by saying two things, we will certainly be doing something and that there is nothing we can do that will make up for your loss of reputation in your customers’ eyes. We know that. The people who will participate in that decision are fried right now, as I know even in your anger you can well imagine. I will ask your indulgence that you give us this week to make our plan in this regard.

There is one thing that I can offer now. I would like to make myself personally available to any of you who would like me to either reach out to your customers, or to any specific customer, with a letter, an email or a phone call. I know this will not often matter but perhaps in a few cases it might. My message here would be simple, this was our fault not yours and while you are responsible for the suppliers you pick, you had good reason to pick us and it was us who let you down. This offer stands whether you are leaving or staying.

In closing, the future determines the past. If we move forward and run the most reliable, service-provider focused, email service the world has ever seen this will be remembered as the few days that turned it around, as being a very important event in forging out mutual future. If we have no change in reliability or in service levels this will barely be remembered. It will just be a point on a mediocre line. I will do everything in my power to make it the former not the latter.

Regards,

Elliot Noss

Page 1 of 41234»