blog

Cluster A Email Service Issues

At approximately 17:30 UTC on October 6, Resellers on Cluster ‘A’ began experiencing issues with accessing mailboxes . We’ve put together some questions and answers to better inform Resellers what happened and what we’re doing to restore full service as quickly as possible.

The latest status updates are available at the OpenSRS Status Page and we will continue to update you here as well.

Q: How can I tell whether I’m affected?

A: Only Resellers on Cluster ‘A’ of our email service are affected. If you login to the Mail Administration Center (MAC) at http://admin.hostedemail.com/ or http://admin.a.hostedemail.com/ you are on Cluster ‘A’. If you login at http://admin.b.hostedemail.com/ then you are on Cluster ‘B’ and you are not affected.

Q: Does this affect both inbound and outbound mail?

A: No. Only inbound email is affected. Outbound messages will be sent as normally.

Q: What will end users see?

A: At this point, there are only intermittent issues accessing mail. Users logging into webmail may see a “Service Unavailable” error message. Users accessing mail via POP or IMAP may experience a denied login depending on the specific email client being used.

Q: When will this be fixed?

A: There’s no firm restore time at this point. Our Network Operations Center is working on the issue and the goal is to restore service for all users as soon as possible.

Q: Where can I find out more information about what’s happening?

A: You can track the status of the email service at http://status.opensrs.com/ where we post all status updates for OpenSRS services. You might also want to ensure that your contact information in the Mail Administration Center (MAC) is complete and accurate as status updates are automatically sent to the Emergency and Maintenance contacts listed in the MAC.

We’ll also post to the Reseller Blog when we need to provide more information than the status website allows.

Q: What about data? Is user data safe and what’s happening to incoming mail?

A: All user data (mail, contacts, filters, etc.) is safe. Incoming mail is being queued locally, on our system, and will be delivered as soon as possible.

Q: Is this related to the Cluster ‘A’ service interruption you experience in August?

A: No. While it IS on the same Cluster, we have no indication that this is any way related.

Improving the stats in OpenSRS Email Service

The underlying philosophy behind the statistics functions inside the OpenSRS Email Service is to provide the most flexibility possible. Rather than providing a small subset of stats to all users inside the Mail Administration Center (MAC), it was thought that our users would be better served by providing a full, extensive stats package, allowing you to take that raw stats data and manipulate or analyze it in whatever ways you like.

You’re free to take all that data, downloaded as .csv files and then import it into your own databases or spreadsheets (Microsoft Excel can read these files, for instance) and mash it up in whatever way that you wish. If you want to know what percentage of users prefer webmail, it’s in there. If you want to spot trends in how often users are logging in to check for mail, it’s in there.

You can even have those stats sent to you daily so you can create a system to automatically update and track usage across your email customer/user base.

We think that the stats engine inside OpenSRS Email satisfies the requirements of a wide range of potential users and uses, from support staff who need an easy way to see a particular user’s account activity, to the executive at a large ISP who wants to know how customers are accessing email.

But as I said, we wanted to provide flexibility for all customers and while that type of stats collection and distribution might work well for some of our larger customers, we also realize that some of our customers will want an easy way to look at certain stats without having to do any complex data mining. So we built in some simple, automatic graphing features for some of the most widely used of the stats we collect.

Inside the MAC, depending on your admin level, you can view stats for things like total logins, broken down across the various login types (POP, IMAP and webmail). You can view a range of stats for your entire company, or for a single domain, and even for a single user. You can see visually if a user hasn’t accessed their email in a long time or if there was a sudden spike in outgoing mail from a specific account or domain indicating a possible abuse situation. You can monitor things like quota usage and get a better understanding of how your customers use the email service.

To better explain the stats features within the MAC, we put together a quick screencast that shows some of the functionality that’s available to you:

Closing notes on the Cluster A Email Service Interruption

First off, I’d like to apologize again for the problems that resulted from the problems last week on Cluster A of our email service. Email is a mission-critical service. We know how awful it is to have your personal and business communications disrupted. We are deeply sorry for any problems that resulted from this interruption.

After around-the-clock work last week to restore full service to our impacted resellers, and their end-users on Cluster A, our team took some time today to review what happened with last week’s service degradation.

Last Tuesday, a shelf controller hardware failure meant that 14 disks required a rebuild. This resulted in the degradation of multiple storage volumes. This failure affected 50% of customer mailboxes on OpenSRS Email Service – Cluster A. The restoration process was consecutive for the affected devices and therefore took a number of days to complete. To resolve the issue, we replaced the shelf controller and rebuilt 14 disks. During the service interruption, we made temporary mail stores available to customers. On Friday, once restoration was complete, all mail content (messages and folders) were merged from the temporary volumes to the user’s original mailbox.

As with any service problem of this magnitude, it is essential we take steps to make sure it does not happen again. Before the end of the month we are making storage architecture changes to Cluster A to ensure that we eliminate the chance that a similar event with storage will occur in the future.

Again, let me say that we are incredibly sorry about the impact this undoubtedly had on you and many of your customers.

RESTORED: Email Outage Update: August 15, 16:00 P.M. ET

Update: August 15, 16:00 P.M. ET:

On August 15, 2008: 16:00 P.M. ET, Full service to OpenSRS – Cluster A was restored. No mail or data was lost.  We truly apologize for this service disruption.

Update: August 15, 09:57 A.M.:

Regular updates are available at http://status.opensrs.com/

As of this morning, we can report that about 80% of users on Cluster ‘A’ have full email service. The remaining 20% of customers on Cluster ‘A’ have limited access, via webmail only. Additionally, we have updated our original estimate for restoration of service to all customers.

We now estimate that full service will be restored by Friday, August 15, 4:00PM ET (20:00 UTC).

Email Outage Update: August 14, 9:30 A.M. ET

We continue to provide regular updates on the progress of restoration at http://status.opensrs.com/

We’re pleased to report that more than two thirds of users on Cluster ‘A’ now have full email service meaning they can log in via POP, IMAP or webmail, as usual, to send and receive mail.

The remaining users that are still affected by this outage (about 1/3 of users on Cluster A) have been provided limited access, by webmail only, to their email. They can log into their webmail normally and send and receive email as usual, and view any messages received since the outage began.

We’re continuing to monitor the rebuild of the additional two storage volumes. Our current estimate is that full service to the remaining affected customers will be restored by Saturday, August 16, at 4:00 P.M. ET (20:00 GMT/UTC).

Once again, we deeply apologize to you and your customers that have been affected by this service interruption and we appreciate your patience as we work to restore full service to all users.

Email Outage Update: 8:00 PM ET

For OpenSRS Email Service Customers on Cluster A, we have the following update:

At this time 50% of mailboxes on Cluster A cannot access email. Current status has not changed significantly since our last update but we do have additional information to share on the scope and timeline for restoration of services.

Due in large part to the nature and severity of the hardware failure we suffered we now estimate that rebuilding of the email databases impacted will take longer than we had anticipated.

Our current estimate is that full service to all customers will be restored on Saturday, August 16, 20:00 UTC.

We are naturally working on reducing the time these customers are offline in any way possible – this is our best estimate on the current impact. If at all possible we will restore access to users earlier than that, so you may find that some portion of your customers will regain normal access to their email earlier than our estimated time for full restoration of services.

We understand that email is critical to you and many of your customers and we are in the process of rolling out limited access to mailboxes via webmail. Over the next few hours your customers who log in via webmail will find that they can send and receive new messages. They will however not see old mail that has been stored on our servers. This mail is stored on the volumes that are currently being rebuilt and it is therefore not possible for us to give them access to it.

We have included a System Status message in each impacted user’s webmail inbox notifying them that they have access to new mail but not to email archived on the system. The text of the message can be found in the message posted to the System Status page.

We will continue to provide updates approximately every two hours until all mailboxes are fully restored.

Once again we deeply apologize to you and your customers that has been impacted by this service interruption.

Further Updates On Cluster A Email Service Issues

I wanted to provide Resellers with a more complete update on the status of the email service as work to restore full service continues.

OpenSRS phone support is seeing very high call volumes at this time. As much information is being provided via System Status at via http://status.opensrs.com/ and via the Reseller blog. We’re updating the status page as information becomes available.

We’ll update and add to this blog post as things change and as further information becomes available.

In the meantime, we’ve attempted to answer a few of the questions that you may have:

Q: How can I tell whether I’m affected?

A: Only customers on Cluster ‘A’ of our email service are affected. If you login to the Mail Administration Center (MAC) at http://admin.hostedemail.com/ or http://admin.a.hostedemail.com/ you are on Cluster ‘A’. If you login at http://admin.b.hostedemail.com/ then you are on Cluster ‘B’ and you are not affected.

Q: How can I tell how many, or which of my customers are specifically affected?

A: Unfortunately, there is no way to tell exactly how many, or which of your customers are affected. We do know that as of the latest estimates, approximately 50% of the total users on Cluster ‘A’ are affected. The service is designed to spread mailboxes out over multiple volumes as opposed to bunching a single customer’s users together as a group.

If a customer can login to email, they are not affected and to them, email will be fully functional.

Q: What will end users see?

A: Users logging into webmail will see a “Service Unavailable” error message. Users accessing mail via POP or IMAP will experience a denied login depending on the specific email client being used.

Q: When will this be fixed?

A: There’s no firm restore time at this point. Our Network Operations Center is working the issue and the goal is to restore service for all users as soon as possible.

Q: Where can I find out more information about what’s happening?

A: You can track the status of the email service at http://status.opensrs.com/ where we post all status updates for OpenSRS services. You might also want to ensure that your contact information in the Mail Administration Center (MAC) is complete and accurate as status updates are automatically sent to the Emergency and Maintenance contacts listed in the MAC.

We’ll also post to the Reseller Blog when we need to provide more information than the status website allows.

Q: What about data? Is user data safe and what’s happening to incoming mail?

A: All user data (mail, contacts, filters, etc.) is safe. Incoming mail is being queued locally, on our system. In the case of mail for the 50% of users that are not affected, mail is being received and delivered as normal.

Update On Cluster A Email Service Issues

I’d like to provide you with an update on an email outage we are experiencing on Cluster A of our Email Service.  Resellers on Cluster B are not impacted by this.

First off, let me say that we are incredibly sorry about the impact this is undoubtedly having on our resellers and many of their customers.  This shouldn’t have happened and while our focus right now is on letting the team get the system back online we will be looking very closely at how this happened to ensure it doesn’t happen again.

Here’s what we know at this time:

About eight hours ago we suffered a major hardware failure in a NetApp file storage system that is an integral part of our Email Service. The system is of course built with all kinds of redundancy but this hardware failure came at the worst possible time.  A hardware issue affecting redundancy had been found earlier and was corrected in Cluster B last week, with a correction for Cluster A awaiting the arrival of the needed hardware.  However, today a separate hardware issue arose when a disk shelf controller in one of our NetApps failed.  It is the confluence of these two scenarios that puts us in this situation.

We have now replaced the faulty hardware and we are close to having the NetApp back online and in service.

Unfortunately, about half of the mailboxes on Cluster A will now be largely offline while we rebuild the parts of the system that were impacted.  Without drilling down too much into the overall system architecture there are a number of “volumes” within each cluster that store and manage mailboxes.  Three of those volumes now need to be rebuilt, in sequence.  As a volume is fully restored we will bring it back online.

The entire Operations Team and our Network Operating Center Team are working with developers throughout the evening and will continue to work on this until it is fully resolved.  Our CEO Elliot and I are in regular contact with all these teams as well as our communications and support teams but, to be honest, we’re trying to stay out of their hair and let them do their jobs.

The big unknown right now is how long it will take to restore each one of these volumes within Cluster A.  Until we let the rebuilding process continue for 6 to 8 hours we are not confident that any estimates we give at this point will have any meaning.

Right now our focus (in order) is a) rebuilding the volumes and putting them online as soon as we can, b) looking for alternative workarounds that may reduce either the scope or length of the outage, and c) creating and refining our estimates on when full service will be restored.

It’s probably worth noting that the distribution of mailboxes within the volumes is largely random and therefore most resellers on Cluster A will find that some but not all of their customers are impacted.

Once we have the service fully restored we will start the process of investigating root causes and determining how we can make sure this doesn’t happen again.

Once again, on behalf of everyone I want to state how deeply sorry we are that this has happened and we truly appreciate your patience and understanding as we work to resolve the issue at hand.

The best place to get updates on our progress is the OpenSRS Status page.  I’ll post further updates here if we need to share more than we normally do in a Status Update.

Ken Schafer
VP, Product Management & Marketing

ISP-Planet talks Email with Rohan Jayasekera

Recently, ISP-Planet’s Alex Goldman had a chat with Rohan Jayasekera, Director, Tucows Email Service.

In the article, posted today, Rohan explains our philosophy in creating and running the Tucows Email Service. He also points to some of the features and benefits that make our hosted email solution a sensible choice for service providers.

If you’re in Chicago this week for ISPCON Spring 2008, drop by booth 114 for a chat and to see the Tucows Email Service first hand. And you can hear more of what Rohan and other industry experts think about the future of email on Wednesday, May 14th, at 8:45 AM in Room 9. Rohan will be participating in a panel discussion, “Who Should Be Running Your Email?

Even if you’re not going to be in Chicago, you can still kick the tires – you can get a Tucows Email Service demo account here.

The latest Tucows Email Service updates are now live

Last week we told you about a number of enhancements we were making to the Tucows Email Service. Much of those efforts focused around making the webmail interface more brandable. Those updates, including the new branding features, are now available in both the Live and Test email environments.

Existing brands aren’t affected by these changes.

There’s more information in the updated documentation. A release note listing the changes made in this release has also been published.

Page 2 of 4«1234»