Note: The following is a recount of an incident that happened while I was working as an SRE at Twitter. I do not have access to the original outage report, and am simply re-telling a story. While I may not fully remember all the specifics, the general gist of what is told here is as I remember it. Any mistakes, omissions, etc, are wholy mine.
The Background
Twitter (the organization that existed pre-X) built a social network that
was the envy of many technology companies in Silicon Valley. While never as
large as FaceBook (or now Meta), it was a force to be reckoned with in its
day. While you could read public tweets without logging in to Twitter, the
best experience was naturally when you created an account, and used Twitter
while authenticated to the service. The incident I’m going to describe was
called IM-xxx
(anyone rembering the actual IM
number, please let
me know).
As many software engineers know, handling anything related to time within
software is notoriously difficult. Seriously, read that last link. This
general issue, along with most language environment’s notoriously error prone
to use libraries (modules, packages, etc), make for many hard to find errors
being part of all software out there. To get an appreciation for the complexities
involved, have a look at the Java 8SE DateTimeFormatter documentation. Pay
particular attention to the meanings of the formatting characters Y
, y
,
m
, and d
.
The Leadup
Twitter was available via a number of clients, SMS (yes, really), Web, native iOS, and native Android clients. The most prominent one in this story was the native Android client. The diversity of hosting environments in the Android ecosystem meant that it was the most variable one supported. The variability in hardware, operating systems, software, cell providers and geographically diverse locations was the greatest in our native Android client. One way this manifested itself was on this client, there was no really singe reliable way to have a concept of “the current time” on the client. Some of the Android devices were simply horrible at keeping any sort of accurate time, or even progression of time.
So, on the Android clients we had code where we would use a combination of the device’s local time, as well as the date and time in the headers from the HTTPS API endpoint as it responded to any API calls from the device, to fabricate a local “current time”.
Twitter, as a social network, existed in the global community. It crossed significant boundaries, from socio-economic, to geographic, to political. Many of the places Twitter was used, it was a trusted source of information beyond, and sometimes in spite of, the officially sanctioned sources the local government provided or allowed. As such, Twitter had to contend with any number of rather sophisticated threats against both itself and its users. Any quick Google search will reveal any number of investigations and reports.
The Setup
As part of hardening the client to API endpoint connection, several options were implemented as a strategy for a “defense in depth” approach to securing our users. One of these was that a client would be logged out (access and/or session token revoked) if things looked “weird and/or out of place”. Any number of things could make this happen. In general, the user of the session would simply re-authenticate with the login server, using whatever authentication mechanism they set up. This could include 2 factor challenges, and other similar mechanisms. One of the items that was checked, and would cause a revocation of the session ticket, was the client’s idea of what date the client believed it was. This check was pretty loose. I seem to recall that it took a multiple day difference between the client and the server for the session to be revoked.
The next piece I’m a little fuzzy on, it’s been 10+ years! However, if I recall correctly, this particular Twitter API endpoint was terminated by a Java HTTP/S server. As part of the response the API code generated the current date and time to send back as part of the response headers:
HTTP/1.1 200 OK
Date: Sun, 26 Jan 2025 20:57:16 GMT
The code that did this used industry standard methods and objects to format this
response. In particular, it used the format string: E, d MMM YYYY HH:mm:ss z
.
The astute among you, or the ones that read the Java DateTimeFormatter docs in
depth, and remembered and understood all the format characters will be rather
triggered by this format string. TL;DR – YYYY
is very different from the
correct yyyy
. Note, for most of any given year, YYYY
and yyyy
are the
same.
The Trigger
The end of the 2014 year was a time many Twitter engineers took a well deserved rest. The previous year was one of immense growth (we went public in November of 2013) as well as many significant events. We had survived the World Cup, Hong Kong protests, to #BlackLivesMatter protests. Every one of these events cluminated in our infrastructure becoming more resilient as we had finally mastered the fail-whale. As such, many of our engineers took an extended vacation around the end of year holidays. Deployments were frozen, and only a few SRE, TCC, and other on-call folks were around.
As the calendar below shows, by taking off a few strategic days, you could end up getting 15 days off.
Everything was humming along smoothly until Monday December 29th, GMT. At this point, our authentication service was unable to keep itself upright. The load on this service quickly increased to the point where a task was unable to keep running, not even to get healthy. At first we thought that we had some type of code issue, but that was quickly ruled out, as no rollouts had been happening. Next up was the thought that we were being attacked with some type of denial of service attack. This took significantly longer to rule out.
Dumping headers of the affected traffic showed that clients were sending the wrong
date to the API endpoint. They were sending a date of 2015-12-29
in the year 2014
.
At this point I was pretty sure what was going on. A quick code search confirmed my
suspicions. Someone had used the “ISO Year with week” version of the year formatting
character to format the date. As one can see in the table below, the year as reported
by the different format characters was “incorrect” for the 29th, 30th, and 31st of
December in 2014.
Actual Date | Numeric Date | ISO Year/Week |
---|---|---|
Sun, 21 Dec 2014 | 2014-12-21 | 2014/51 |
Mon, 22 Dec 2014 | 2014-12-22 | 2014/52 |
Tue, 23 Dec 2014 | 2014-12-23 | 2014/52 |
Wed, 24 Dec 2014 | 2014-12-24 | 2014/52 |
Thu, 25 Dec 2014 | 2014-12-25 | 2014/52 |
Fri, 26 Dec 2014 | 2014-12-26 | 2014/52 |
Sat, 27 Dec 2014 | 2014-12-27 | 2014/52 |
Sun, 28 Dec 2014 | 2014-12-28 | 2014/52 |
Mon, 29 Dec 2014 | 2014-12-29 | 2015/1 |
Tue, 30 Dec 2014 | 2014-12-30 | 2015/1 |
Wed, 31 Dec 2014 | 2014-12-31 | 2015/1 |
Thu, 01 Jan 2015 | 2015-01-01 | 2015/1 |
Fri, 02 Jan 2015 | 2015-01-02 | 2015/1 |
Sat, 03 Jan 2015 | 2015-01-03 | 2015/1 |
This was then sent to the client as part of the API response headers, which was then used when the Android client was attempting to devine the correct current date and time, which then resulted in the client sending a date and time that was a year in the future with the next API call. The server threat mitigation mechanisms at this point kicked in, causing the session to be revoked, which ultimately lead to the client requesting a new login.
As all the Android clients worldwide began to have their sessions revoked, they all attempted to connect to our login service. A service that was not sized for this increase in traffic.
The Mitigation
Once we understood the trigger and root cause of the outage we moved into mitigation mode. This was significantly troublesome due to several factors:
- Most engineers were on vacation and unreachable
- The authentication service was a bit of a snowflake, a bit different than most
- Scaling up the authentication service (by over 20x) did not appreciably help
- There was no quick way to push an update to substantially all Android clients
- We could not identify any experiment, or other flag, that we could flip to turn off this feature
- The only other control surface we could identify was global in nature and would impact all clients attempting to authenticate
It was a bit frustrating that the only lever we could identify was to globally load shed all traffic hitting the authentication service. By load shedding traffic destined to the authentication service before it hit the authentication service, we could ensure that the authentication service could stay operational and survive. At the peak of the mitigation, we were shedding more than 95% of the load on the authentication service. Note, at this point the authentication service had been scaled up by over 20x and was still not able to handle the increased load. Also note that this mitigation impacted all our clients, iOS, Web, etc.
This particular outage required an emergency patch, build, and deployment of the API endpoint. Once that was rolled out, the system recovered quickly.
The Incident Review
This incident brought a number of issues up in the review process. While a recurrence of the exact same instance was unlikely (or at least a year away), there was some significant risk that a different set of circumstances could lead to a very similar incident. The items that I recall being identified as ripe for systemic fixing were:
- Search the whole codebase for the string
YYYY
in all Java source code and ensure that it was the correct string (depending on the context). We found multiple instances in the code that were incorrect. I vaguely remember that out of all the instances found, only a single one was correct. - Educate engineers on the “ISO Year with week-of-year” part of the Date/Time libraries that most languages have.
- Create an automatic test that runs as part of code review to identify potentially troublesome code with the incorrect format characters.
- Enable per-service load shedding for the authentication service. Ensure other outside facing API services had load shedding enabled.
- Design and implement a way to spread out the revocation of session tokens due to this type of condition. In particular identify identify and use risk as a method to delay, randomize, and spread out revocations.
- Create a lever where we could gain local control of the rate of revocations, including being able to turn off the feature completely.
- Move the authentication service to have a standardized build and deployment process.
- Ensure oncall rotations did not schedule employees on vacation to be on call.
Note: One of the reasons this string format is so insidious IMHO, is that the
string YYYY-MM-dd
is incorrect, while yyyy-MM-dd
is correct. From a human
estetic, it just seems incorrect to have your capitalization be in this weird
“inconsistent” state.
Conclusion
Dates (and time) are hard. If you don’t have to use them, manipulate them, or use them in any fashion, you will do yourself (and others) a huge service by not touching them in code. Pretty much any time you are using a point in time, compare times and dates, or use a duration, you’re opening up a large set of potential errors. As such, if you can avoid them, do so.
If you absolutely need to use the concept of time within your code, ensure you practice robust coding techniques. Use industry standard packages, libraries and utilties. Fully understand their correct use cases. Write plenty of tests for all edge conditions. Write internal APIs and libraries tailored to your company’s needs to minimize the flexibility of the interface surface and minimize the possibility for mistakes and errors.