Note: The following is a recount of an incident that happened while I was working as an SRE at Twitter. I do not have access to the original outage report, and am simply re-telling a story. While I may not fully remember all the specifics, the general gist of what is told here is as I remember it. Any mistakes, omissions, etc, are wholy mine.
Info: Unfortunately, my graph drawing extension is not working yet. Once it does, I’ll have to come back and instrument this blog post with some graphs.
Background
Twitter (the organization that existed pre-X) built its own internal Kubernetes like cloud infrastructure. It was built on a couple of systems called Mesos, and Aurora. Broadly speaking the Mesos system was the system that existed on every system as a client, with a central piece running on an odd number of systems. The Mesos system kept track of resources and tasks running across all of the containers within a cluster of machines. The Aurora system was the scheduler for jobs that were ultimately run on the Mesos system. The Aurora system was also a set of leader elected jobs, although, if memory serves, the Aurora system used Zookeeper to do its leader election.
Warning: At the time we were using Mesos and Aurora, it was common to use the unfortunate phrasing for indicating the “masters” and “slaves” of this system. Personal growth has made me aware that such phrasing is not only not desired, but also hugely hurtful and counter productive. Any remaining undesirable phrasing is mine and in error.
The original Mesos system used the terms “master” and “slave” extensively in the documentation. I will use the terms “leader” (or “leaders”) and “follower” (or “followers”) instead. The mapping to the original documentation is roughly the following:
- leader –> a single elected “master”
- leaders –> a collection of nodes that could become “master”
- follower –> a single “slave” node
- followers –> a set of “slave” nodes
The Mesos leaders were an odd numbered set of machines running the Mesos
leader code, participating in a Paxos quorum to elect a leader that would
receive updates from all followers, communicate with the scheduler, and then
send commands to the followers to run commands within resources (memory, etc)
that the followers had. In this way we coordinated many thousands of machines
and millions of tasks within a single cluster. Twitter had multiple of these
clusters spread around the internet. The incident I’m going to describe was
called (to the best of my recollection) IM-911
.
The Leadup
While we were building the Mesos and Aurora systems, we would regularly run into various issues. Some of these were easy to resolve, others took the coordinated effort of multiple teams to figure out what exacctly had occurred. This particular incident was one of the latter. Unfortunately, time has erroded my memory of what exactly was part of the lead up to the incident, but suffice it to say it was likely something along the lines of:
- network driver issue that resulted in some sort of kernel or Mesos issue
- kernel interrupt issue that resulted in some sort of perfomance issue
- kernel network code issue due to our use of cgroups in interesting ways
- something else that skips my very fallible memory
Basically, I vaguely remember that we tracked our issue down to something along the lines of:
- only occurs on the currently elected Mesos leader
- might be something systemic, not isolated to the Mesos binary
- did not seem to be machine specific
- might be machine type specific
- might be firmware or BIOS specific
At this point, we were pretty stuck. Forcing a leadership change would resolve the issue for a short, indeterminate amount of time, after which we would have another incident where the performance of the Mesos leader was not able to cope with the load from the followers. At this point we were running out of options, and reached out to our very capable kernel team to have them give us a helping hand and diagnose what is going on.
The Setup
As part of our investigations, the kernel team asked us (the Mesos SRE team) if we would be ok with them taking a kernel core and stack dump of all threads on the current Mesos leader machine. We were not too worried about this, as we fully expected one of the other Mesos leaders to use Paxos to establish a new leader and for our cluster of followers to continue working just fine. After all, we’ve failed leaders many times in the past without any issues. So having our elected Mesos leader hiccup for (what we thought) a short time, should not cause any issues.
As such, the kernel team obtained a serial console on the machine and proceeded to issue the required keystrokes to have the kernel dump all the thread stacks, all the memory regions, etc. Surprisingly, this took a long time; a lot longer than we expected. However, during the 15+ minutes that this took, we verified that the remaining running Mesos leaders noticed that the old leader was not healthy anymore, and they negotiated a new leader. Things were being mitigated and handled as expected. Unbeknownst to us, we were in a “interesting timeline” that was about to present us with many valuable lessons…
The Trigger
Once the kernel team had the information that they required to do a full analysis, we debated shortly about what we should do: reboot the machine that we collected the information from, or simply hit “continue” and let it go about its way. We opted for what we believed was the least intrusive option, we told the kernel to “continue”. This was a mistake… but at the same time, we learned a lot about our system in the next 9+ hours that we would not have learned otherwise.
Soon after we hit “continue”, we noticed that there were a couple of things happening that were unexpected. We noticed that our leading scheduler was very busy, that many of our tasks were being rescheduled, and that we had many of our followers killing off tasks that they were happily running just moments before. We also noticed that a substantial, what looked like all of our Mesos followers were going through a reconnect with the Mesos leader, which did not make much sense. None of our monitoring indicated any sort of global network connectivity issue.
The Mitigation
Even as a single cluster was rather unhappy, we were attempting to diagnose exactly what was going on. The scheduler was queueing up tasks to get scheduled and our Mesos leader was busy communicating new assignments to followers, and attempting to keep up with any changes in followers and the resources they had available. For the most part, we were maintaining a sinking ship by pumping water out of our bilge as fast as we could. Many of the jobs we were running for Twitter engineers had many tasks killed, but due to rack and machine diversity requirements within our scheduling algorithms, most of our top line metrics were kept within SLO. We ultimately did pull the trigger on a full cluster fail-over, which was started about 15 minutes too late and resulted in a 2 minute fail-whale.
Warning: Not complete, to be continued…