blog

Posts by Aj Mirani

SuperAdmin (SysAdmin) Appreciation Day

Binary

SysAdmins can be thought of as perhaps the medical doctors of the technical world – heroes in their own right. They are the silent protectors, the caped defenders of our electronic cities. Armed with nothing but their sharp wit and lightning fast fingers, they bring to life an otherwise lifeless world. They rush to the aid of ailing servers and networks. They build enormous arrays of redundant storage, take the pulse for an ocean of applications and keep a watchful eye on row upon row of blinking servers. Who are these masked crusaders and what do they seek, you ask? And more importantly, do they have any super-powers? Fear not, they do not swing from buildings or trees, nor do they fly through the sky or have super-human strength. No, they view their world through an electronic window; taking comfort only in maintaining order for all within their domain – ever vigilant.

At OpenSRS, we have a small army of Unix variety SysAdmin heroes. They work tirelessly day and night to ensure we maintain a detailed analysis for our diverse environments. Behind the scenes these heroes spend endless hours planning and executing preventative measures against our common foes: downtime and degradation. Working with developers and engineers to bring new products and services to life in a configuration that is the most resilient and scalable available. Pouring over millions upon millions of lines of logs each day, the tiny footprints in the sand left behind by each application, they ferret out problems before they surface. They build walls of fire to protect us from the ever-evolving number of viruses and exploits lurking in the ether. And they bring balance to the chaotic world of packets, TCP and UDP alike.

Their work, frequently so successful and seamlessly integrated, by design goes unnoticed. Their super-hero preemptive work is critical for the achievement of nerd nirvana (or nerdvana as we like to call it here.) In general though, we tend to recognize our SysAdmin heroes most when they bring incidents to resolution. Well, today is their day. So to all of the SysAdmins here at OpenSRS and out there; you know who you are: We salute you!

Some of Our SysAdmins demonstrating Whois Privacy (they are shy)
Some of Our SysAdmins demonstrating Whois Privacy (they are camera-shy)

Photo: Paul Tichonczuk, Senior Web Application Developer

Geeks invade LISA’08

LISA 2008 Conference

Last week I was thrilled to attend the Large Installation Systems Administration (LISA) 08 Conference put on by Usenix in San Diego, California. I was joined by two of our system administrators, Ted and Shaw. This was my first year attending LISA, and based on the wide range of workshop topics, I knew we were going to be treated to a very exciting week.

The atmosphere was electric as we met some of the technical world’s top celebrities and learned about the latest advancements in the industry. In some cases, LISA was the official unveiling of state-of-the-art new tech tools. A perfect example of this was a paper presented by a group from the IBM Almaden Research Center on a new Large-Window Compression tool geared toward the efficient storage of Virtual Machines (VM).

IZO Compression

This tool, called IZO, will work alongside traditional compression tools we already use such as gzip and bzip2. IZO efficiently compresses VMs by chunking the data. Then it compares the chunks with a hash function subsequently indexing and removing duplicate chunks. This is akin to data deduplication. Once this process is complete, the data footprint is significantly smaller and can be passed through gzip, for example, which does the traditional Small-Window Compression. The result is a dramatically smaller final footprint for the original data. As a bonus, this method is often faster than using gzip alone. I won’t go into too much detail in this post, but you can read more about IZO and how it works as well as comparisons to existing Large-Window Compression tools such as rzip or lrzip in the full paper, posted here: IZO: Applications of Large-Window Compression to Virtual Machine Management.

For the simplified summary, you can think about this compression method in terms of packing a suitcase: if you were, for example, to take all your clothes and simply stuff them into your suitcase, you would pack in as much as you could, then maybe sit on your suitcase to try to compress it enough to get the zipper closed. However, if you first folded your clothes neatly, then placed them in your suitcase and only then closed the lid, sat on it to compress and zipped it up, you would be able to get a lot more clothes packed into the same space. Using IZO on your files is essentially like folding them neatly before trying to pack them into a gzip archive (only much faster and cooler than folding shirts).

High Performance Computing

High Performance Computing (HPC) was another topic featured at LISA. The advances and challenges in HPC today are a good indication of what is on the industry’s horizon. I start feeling nostalgic thinking back to 1999/2000 when I was helping build out 100+ node Beowulf clusters as a more economical alternative to the Silicon Graphics Origin 3000 series (which ran on the MIPS processors and distributed shared memory architecture). This was a time before multi-core technology even existed. Today, these systems are dwarfed by several orders of magnitude.

Argonne National Laboratory are currently running one of the largest HPC clusters in the world. Their jaw-dropping 40,960 node cluster, housed in only 40 racks, is based on IBM’s Blue Gene/P system. They presented a paper outlining the experiences and challenges of running such a massive system: Petascale System Management Experiences. The days of CPU bottlenecks are over and the era of true cloud computing is fast approaching. At OpenSRS, we are already seeing and assessing these trends. A number of components in our architecture use the concepts of clustered computing and can be organically expanded and contracted to fit our needs.

The idea of virtualized systems has been around for a while, but has always been tightly tied to the physical platform. Today we have already started to divorce the two by deploying virtual machines essentially at will. The ease of deployment solves many problems and increases our overall flexibility. We have seen some of the same problems with nodes becoming I/O bound in some cases while competing for resources. By keeping a vigilant eye on dimensioning we have thankfully been able to keep these sorts of caveats in check.

Log, Trend and Relation Analysis

There are some other challenges which are often overshadowed by the focus on performance and availability. When dealing with the scale of the systems described above, the volume of logs generated by the system can be astounding. For example, a single incident on this system can generate up to 160,000 messages. The ability to efficiently parse and run diagnoses on this volume of data is essential. Currently, on the OpenSRS Email platform we generate over 100 Gigabytes of logs daily. All of this is with debug mode/verbose logging turned off. This is just the tip of the iceberg if you include logs produced on the OpenSRS Domains platform and pure system-level logging. The future will increasingly see these volumes balloon rapidly as the platform grows and we process more transactions. Tools to address these challenges are becoming readily available with the emergence of cloud computing.

Splunk is an example of a piece of commercial software designed to help analyze these large volumes of log information. If you haven’t seen it before, the free version will allow you to parse 500M of raw log data daily and is available for download from their site. Also interesting is piece of software developed by University of Notre Dame called ENAVis (Enterprise Network Activities Visualization). ENAVis offers a very unique visualized view of a platform. It parses system statistics at regular intervals to create links between hosts, users and processes providing a single picture of the entire platform. The interface allows one to drill down and look at a vast number of metrics. To get more details on this project read their paper here.

My personal focus at LISA was around virtualization, massive storage and compute clusters, a major focus for many organizations this year. There was no shortage of people willing to share their experiences on these subjects. I’ve touched on some of the highlights above, but its nearly impossible to capture the atmosphere this conference provided. The whole point of any professional conference is to help people to be able to make better decisions. Being able to have candid conversations and share experiences is what makes it all worthwhile. It’s clear that the challenges are the same for everyone operating at the massive scale. The innovative solutions being developed are truly exciting. We will continue to analyze these developments and see how they fit with our needs to serve you.

Technical Debrief on October Cluster A Email Service Issue

Aj Mirani is Manager of Technical Operations for OpenSRS and is responsible for coordinating the strategic aspects of technical issues, long-term capacity planning and resource allocation for technical projects. He leads the team of Unix administrators and network administrators directly responsible for running the servers, network and storage devices across all platforms. Below he’s answered questions raised by resellers during last week’s service disruption.

Definitions

Linux: a free Unix-type operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. OpenSRS uses the Linux operating system on all of our services.

Dovecot: an open source IMAP and POP3 server for Linux/Unix-like operating systems. OpenSRS uses the Dovecot mail server software across our Email platform.

NetApp: a storage and data management provider used by OpenSRS to manage our email service. This technology is used by many major service providers.

Bugs

Can you describe the Linux kernel bug that was found?

In 2.6.19, a patch was introduced to the Linux kernel by developer Neil Brown which added an “optimization” whereby if you have only TCP NFS mounts, the Linux kernel on the client will not listen for UDP NFS lock callback messages, as it was believed that an NFS server would always send lock messages over TCP if all mounts were also TCP.

However well-intentioned, this optimization does not hold true in all cases. For example, Netapp filers will still perform UDP NFS lock callback messages with their clients, even if they are using TCP NFS for all volume mounts.

This is not a bug on the part of Netapp, since there is no specification in the NFS RFCs that TCP mounts necessitate 100% TCP lock messages. Rather, it’s up to the NFS client to be available and listening to all NFS lock messages, whether TCP or UDP.

Moreover, Netapp explicitly chose to use UDP for some lock callback messages as the overhead on short messages of this type is significantly less in UDP than it is in TCP. In short, it scales better to do so for these messages.

The end result is that the NFS clients (IMAP servers in this case) were not able to perform full NFS locking all the time, and that resulted in clobbered writes, leading to corrupted indexes which necessitated full Dovecot index rebuilds.

Can you describe the bug that was found in Dovecot?

There were in fact two bugs found in Dovecot, both of which ultimately led to the same end result, that being a bad index file.

First, if a user logs in to their mailbox and Dovecot detects a problem with their index, it will attempt to reindex their messages. Should that user’s connection be closed for any reason, Dovecot will detect this but still continue with the reindexing if it has not yet completed it. This is good because, in terms of resource utilization, the reindex is a very expensive operation and Dovecot doesn’t want to have to do this more than necessary. The bug, ironically, is that once Dovecot went through the effort of reading every message in the user’s mailbox and it came time to actually write the index, a subroutine would detect that the user was no longer connected and abort the final write operation.

Second, the virtual size is the size of the message within the POP3 protocol, which can differ from the size of the message on disk. Under some conditions, Dovecot would do the work of rebuilding the table of virtual sizes, but would not ever write it out. The POP3 session would function normally from the user’s point of view, as the virtual size table was now in RAM, but the next time the user logged in, the virtual sizes would have to be reconstructed again, which caused a rescan of all mail in the mailbox.

Linux Questions

What distro and version of Linux are you using?

We have standardized on the Debian ’stable’ release as a base with security patches. It has a proven track record with us and we like the rigorous prerequisites which need to be met before packages can be considered for this release. The following article describes the life-cycle of packages within Debian distributions prior to being considered for the ’stable’ release:

Debian Package Life Cycle (Wikipedia)

During our initial architectural design of the mail platform, there was a lot of heated debate as to our selection of standard OS and hardware. In the end, after looking at all the other options, including Solaris and FreeBSD, we decided on Linux as the best candidate. We still believe this was the best choice not only from a reliability standpoint, but also from a performance perspective.

It sounds like you upgraded your servers to a newer Linux kernel without sufficient testing prior to production deployment. Do you have load-testing capabilities to test changes prior to launch? If so, how/why did this get past that stage? (Greg Youngblood)

We are very cautious around any changes made to the production environment. Even small changes are made with a healthy amount of paranoia. Something as major as a kernel upgrade is not taken lightly and goes through a lot of scrutiny before reaching our production servers. Even when we consider an upgrade fit for production, we start with limited deployments on non-customer-impacting servers. As the upgrade proves itself reliable, we begin rolling it out to other environments. To give you an idea of how long this can take, this particular kernel spent about 45 days running problem-free across approximately 400 servers in our production environment prior to its final rollout to the Dovecot servers. This was after it spent significant time first in our development and subsequently in our QA environments. We do use NFS extensively across all of these environments. We specifically chose this release because of the major enhancements for NFS that it included. Unfortunately, even with rigorous testing, the reality is that in any environment, bugs do make it into production sometimes. We have many layers of protection in place to mitigate these. In this particular case, the combination of the problem with the kernel in conjunction with the Dovecot bug quickly pushed our load to literally ten times what we normally experience. We do make sure our systems have plenty of spare capacity and we can handle a lot of extra load. Unfortunately, not ten times regular capacity as we experienced last week.

Have you considered switching from Linux to Solaris? Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out. (Greg Youngblood)

This may be the case for NFSv4 at the moment. We don’t believe that NFSv4 is proven enough for us to use in production for now. From Netapp’s perspective, they strongly recommend using the ‘General Availability’ (GA) version of Ontap (the platform OS) if we are going to implement NFSv4. We are not comfortable using anything other than the ‘General Deployment’ (GD) releases however as these are the production-proven versions. Ontap GA is the equivalent of beta for Netapp and only has a very limited production deployment across their customer base. The GD releases are the most widely deployed, most stable versions. The short answer is that we won’t feel NFSv4 is suitable for production for at least another six to twelve months.

Dovecot Questions

Was the Dovecot bug that was fixed related to writing to the index file after the timeout? Or is that issue still there? (Greg Youngblood)

We worked with Timo Sirainen, author and primary maintainer of Dovecot, to patch the Dovecot source. These patches are currently in production across our production environment. Future versions of Dovecot will include these changes as standard.

If you implement bounces (preferably timeouts during delivery so messages stay in queues and don’t get lost assuming you can deliver them within 5-7 days), can you make it adjustable (if possible) with a setting in MAC or reseller interface? I can certainly understand why some would want bounce notifications, but probably not everyone will. I am on several lists that auto unsubscribe you on bounces, so for myself personally I prefer not to have them bounced. (Greg Youngblood)

Once we accept mail into our system, it will not bounce. Our architecture is tiered such that even if the underlying mail servers are unavailable, mail is queued internally until such time as it is delivered. Mail that is rejected does bounce at the perimeter only.

What can be done by OpenSRS about secondary MX records? Whilst emails are being queued by the primary MX, I presume that nothing would go to a secondary MX.

Currently the only way mail would go to the secondary MX is if the primary MX does not accept mail. Even through the worst SPAM attacks we have had sufficient capacity to still accept valid mail. We have yet to experience this situation.

System Architecture/Miscellaneous Questions

What type of monitoring system are you using? Why wasn’t this abnormal system behavior caught by your monitoring system in the early stages?

We are using a number of systems to trend and monitor. Primarily we use Nagios for monitoring with the help of numerous custom plugins we developed to provide a more robust testing suite. In Nagios alone we have in excess of 1500 monitoring points across the Email platform. Furthermore, we use Munin to trend additional metrics for long term planning and visibility into scalability trends.

On a micro-scale of say a few thousand users, these bugs were individually negligible (and extremely difficult to detect.) The user experience would have been normal during this period. It took many days of digging by an assembled team of our top engineers and system administrators working in shifts 24/7, in conjunction with some of the best developers in the Open Source world and Tier 3 NetApp engineers in order to nail this down.

The cumulative nature of the kernel locking problem combined with the Dovecot reindexing bug, both compounded by millions of users accessing the system, was the confluence of issues that caused the outage.

As a result of this incident we have added a number of additional monitoring points that will add insight and provide an early warning in the future. While it would be almost impossible to detect the specific failure, we’re in a very good position to detect and track things like locks and other NFS/Netapp interactions.

Why wasn’t Cluster B affected? (Greg Youngblood)

Cluster B, much like the half of Cluster A which remained online was, in fact, affected. Both Clusters use identical hardware and software. We were able to resolve the issue before we crossed the cascade threshold on those environments.

Why didn’t you move our mailboxes to Cluster B? (Edward Gore)

The short answer to this is that it would take well over a week to migrate every user from Cluster A to Cluster B. We are not able to disclose the exact amount of space being used by mailboxes, but it is measured in many terabytes. Cluster B resides in an entirely separate geographically different data center, so we would be limited to Internet transfer speeds between these two data center providers in order to conduct such a migration.

Saying ‘40% of clients of Cluster A’ is all well and good, but who is that? Can you not provide information on exactly WHICH mailboxes are affected – via an API would be useful, we could then work with our clients who ARE affected! (Paul O’Hanlon)

Our hashing algorithm that distributes users across the cluster is highly efficient. We have found that it very evenly distributes users on a large scale. It is possible for us to provide lists of users on affected mailstores although it is unlikely it will be a feature that will be implemented into the API. In the case of last week’s incident, producing those reports would have pulled our engineers away from resolving the root cause and restoring mailboxes.

After an outage, I think users are ready to understand they cannot access their old mail for a while (< 48 hours) but they expect to be able to send and receive new mail within 4 hours on a backup system where empty mailboxes would have already been created in advance. Have you considered this idea? (Augustin L)

We have considered this and are still investigating options. There are some considerations which need to be carefully weighed, especially surrounding clients such as Outlook/Thunderbird/MacMail coming out of message UID sync with the backend when all of the sudden historical mail is not available. One possibility is to offer a webmail-only emergency solution, though this is also not entirely ideal. We’ll keep you posted.

Is this reindexing you did the same or similar to what you did in August?

No, this was a file level reindex of the users mailbox. In August we had multiple hardware failures that resulted in hard-drives requiring a RAID level rebuild from parity.

Shouldn’t the system reindex itself?

Yes, it should, and once we put the Dovecot software patches in place the reindexing was successfully completed by the system itself. Cluster B and half of Cluster A were left to naturally reindex but we chose to take part of Cluster A offline entirely in order to perform a global reindex because of the sheer number of mailboxes which were affected in that specific portion of the Cluster. This reduced the time to restore service.

While mailboxes are evenly distributed, user access fluctuates and is not entirely predictable on a large scale. While access load does balance itself out in the grand scheme, there are times when some portions of the cluster are more used than others.

Why wouldn’t you have redundancies in place to avoid this?

There are many layers of redundancy currently in place on both the hardware and software fronts. Redundancy was not the solution in this particular case. We would have needed to have an order of magnitude more capacity/redundancy to be able to ‘weather’ this, and even then it likely would not have been enough. The best way to avoid this type of situation in the future is through early detection and I strongly believe we’re in a very good position to do that now.

Might splitting your architecture into smaller but more manageable systems be an option? (Augustin L)

Our current architecture evolved from that type of environment so we know firsthand the caveats of splitting the system into smaller sub-systems. Splitting up the Clusters further would, in fact, make them less manageable and less able to distribute load. During our time running the old platform we saw a lot of this and it translated to a much more unpredictable user experience. In short, smaller sub-systems do not scale well.

However, we have engineered in the benefits of smaller manageable systems, as evidenced by only a portion of the cluster being affected.