When Time Stopped

Note: The following is a recount of an incident that happened while I was working as an SRE at Twitter. I do not have access to the original outage report, and am simply re-telling a story. While I may not fully remember all the specifics, the general gist of what is told here is as I remember it. Any mistakes, omissions, etc, are wholy mine.

Info: Unfortunately, my graph drawing extension is not working yet. Once it does, I’ll have to come back and instrument this blog post with some graphs.

Background

Twitter (the organization that existed pre-X) built its own internal Kubernetes like cloud infrastructure. It was built on a couple of systems called Mesos, and Aurora. Broadly speaking the Mesos system was the system that existed on every system as a client, with a central piece running on an odd number of systems. The Mesos system kept track of resources and tasks running across all of the containers within a cluster of machines. The Aurora system was the scheduler for jobs that were ultimately run on the Mesos system. The Aurora system was also a set of leader elected jobs, although, if memory serves, the Aurora system used Zookeeper to do its leader election.

Warning: At the time we were using Mesos and Aurora, it was common to use the unfortunate phrasing for indicating the “masters” and “slaves” of this system. Personal growth has made me aware that such phrasing is not only not desired, but also hugely hurtful and counter productive. Any remaining undesirable phrasing is mine and in error.

The original Mesos system used the terms “master” and “slave” extensively in the documentation. I will use the terms “leader” (or “leaders”) and “follower” (or “followers”) instead. The mapping to the original documentation is roughly the following:

leader –> a single elected “master”
leaders –> a collection of nodes that could become “master”
follower –> a single “slave” node
followers –> a set of “slave” nodes

The Mesos leaders were an odd number of machines running the Mesos leader code, participating in a Paxos quorum to elect a leader that would receive updates from all followers, communicate with the scheduler, and then send commands to the followers to run commands within resources (memory, etc) that the followers had. In this way we coordinated many thousands of machines and millions of tasks within a single cluster. Twitter had multiple of these clusters spread around the internet. The incident I’m going to describe was called (to the best of my recollection) IM-911.

The Leadup

While we were building the Mesos and Aurora systems, we would regularly run into various issues. Some of these were easy to resolve, others took the coordinated effort of multiple teams to figure out what exactly happened. This particular incident was one of the latter. Unfortunately, time has erroded my memory of what exactly was part of the lead up to the incident, but suffice it to say it was likely something along the lines of:

network driver issue that resulted in some sort of kernel or Mesos issue
kernel interrupt issue that resulted in some sort of perfomance issue
kernel network code issue due to our use of cgroups in interesting ways
something else that skips my very fallible memory

Basically, I vaguely remember that we tracked our issue down to something along the lines of:

only occurs on the currently elected Mesos leader
might be something systemic, not isolated to the Mesos binary
did not seem to be machine specific
might be machine type specific
might be firmware or BIOS specific

At this point, we were pretty stuck. Forcing a leadership change would resolve the issue for a short, indeterminate amount of time, after which we would have another incident where the performance of the Mesos leader was not able to cope with the load from the followers. At this point we were running out of options, and reached out to our very capable kernel team to have them give us a helping hand and diagnose what is going on.

The Setup

As part of our investigations, the kernel team asked us (the Mesos SRE team) if we would be ok with them taking a kernel core and stack dump of all threads on the current Mesos leader machine. We were not too worried about this, as we fully expected one of the other Mesos leaders to use Paxos to establish a new leader and for our cluster of followers to continue working just fine. After all, we’ve failed leaders many times in the past without any issues. So having our elected Mesos leader hiccup for (what we thought) a short time, should not cause any issues.

As such, the kernel team obtained a serial console on the machine and proceeded to issue the required keystrokes to have the kernel dump all the thread stacks, all the memory regions, etc. Surprisingly, this took a long time; a lot longer than we expected. However, during the 15+ minutes that this took, we verified that the remaining running Mesos leaders noticed that the old leader was not healthy anymore, and they negotiated a new leader. Things were being mitigated and handled as expected. Unbeknownst to us, we were entering an “interesting timeline” that was about to present us with many valuable lessons…

The Trigger

Once the kernel team had the information that they required to do a full analysis, we debated shortly about what we should do: reboot the machine that we collected the information from, or simply hit “continue” and let it go about its way. We opted for what we believed was the least intrusive option, we told the kernel to “continue”. This was a mistake… but at the same time, we learned a lot about our system in the next 9+ hours that we would not have learned otherwise.

Soon after we hit “continue”, we noticed that there were a couple of things happening that were unexpected. We noticed that our leading scheduler was very busy, that many of our tasks were being rescheduled, and that we had many of our Mesos followers killing off tasks that they were happily running just moments before. We also noticed that a substantial, what looked like all of our Mesos followers were going through a reconnect with the Mesos leader, which did not make much sense. None of our monitoring indicated any sort of global network connectivity issue.

The Mitigation

Even as a single cluster was rather unhappy, we were attempting to diagnose exactly what was going on. The scheduler was queueing up tasks to get scheduled and our Mesos leader was busy communicating new assignments to followers, and attempting to keep up with any changes in followers and the resources they had available. For the most part, we were maintaining a sinking ship by pumping water out of our bilge as fast as we could. Many of the jobs we were running for Twitter engineers had many tasks killed, but due to rack and machine diversity requirements within our scheduling algorithms, most of our top line metrics were kept within SLO. We ultimately did pull the trigger on a full cluster fail-over, which was started about 15 minutes too late and resulted in a 2 minute fail-whale. Mea culpa…

The Incident Review

After the dust had settled and we finally had time to digest the logs and all the data we gathered during the incident we realized that what had happened, was that the time had stood still for one part of the Mesos code, while another happily knew that time had continued on its merry way. Roughly, what had transpired was the following:

We froze the kernel (and userspace) of the Mesos leader via serial console.
The other leaders noticed that the current leader was absent, and elected a new leader.
The scheduler failed over to the new Mesos leader. The Mesos followers reported their status and resources to the remaining two Mesos leaders, with the one elected Mesos leader acting on both the Aurora scheduler and Mesos follower information (as designed).
Some time later, the old, frozen Mesos leader was resumed. The kernel resumed scheduling processes, including the Mesos process.
The thawed Mesos process started going through its queue of Mesos followers and checking when the last time it saw a status update from a Mesos follower was. Noticing that each host in the list of Mesos followers had not checked in with a health check for too long, it assumed that all these Mesos followers were unhealthy.
Believing it was still the leader, the thawed Mesos process started to mitigate by initiating “lost follower” handling for each and every single one of the Mesos followers (every single cluster host, thousands of hosts, hundreds of thousands of tasks).
The thawed Mesos process sent a message to the Aurora scheduler for each and every task that the particular task was “lost”. This caused the Aurora scheduler to reschedule every single task that was reported “lost”. It blindly believed that only a currently elected Mesos leader would send such messages.
As the Mesos followers started to check in with the Mesos leaders and report their health, the thawed Mesos leader noticed that hosts with “lost” tasks were reporting in. To make the world correct with its world view, it promptly told each of these Mesos followers to kill all their tasks.
The Aurora scheduler got very busy rescheduling tasks that were reported “lost” (followed by them being terminated), as well as all the churn within the cluster from Mesos followers being told to summarily terminate all existing tasks.

There are a number of lessons that we took away from this outage:

The “local” view of time is not really something you can rely on being consistent. Do not make decisions on time if you do not absolutely need to.
Leader election and the concept of leadership is at best a guess and at worst incorrect. It is an optimization. Code accordingly. Code defensively.
None of the operations (marking Mesos followers as “lost”, etc) had any kind of rate limiting. Always have a controlled rate of change for changes to your system(s).
Don’t be afraid to drain a full datacenter when things look like they’re going to go sidways. Keep that drain operation fresh and practiced.

Conclusion

After this outage the code was changed to make it more robust to Mesos leaders that were only “sick” as opposed to gone and restarted. A number of places were changed to add checks for current leadership (not perfect obviously). Other changes were made to rate limit the number of changes the code would act on, giving other parts of the system a fighting chance to keep up and mitigate.