old headshot of Tobias Weingartner

Toby “Nutty Swiss” Weingartner


When Time Jumped

Created 2025-01-26

Note: The following is a recount of an incident that happened while I was working as an SRE at Twitter. I do not have access to the original outage report, and am simply re-telling a story. While I may not fully remember all the specifics, the general gist of what is told here is as I remember it. Any mistakes, omissions, etc, are wholy mine.

The Background

Twitter (the organization that existed pre-X) built a social network that was the envy of many technology companies in Silicon Valley. While never as large as FaceBook (or now Meta), it was a force to be reckoned with in its day. While you could read public tweets without logging in to Twitter, the best experience was naturally when you created an account, and used Twitter while authenticated to the service. The incident I’m going to describe was called IM-xxx (anyone rembering the actual IM number, please let me know).

As many software engineers know, handling anything related to time within software is notoriously difficult. Seriously, read that last link. This general issue, along with most language environment’s notoriously error prone to use libraries (modules, packages, etc), make for many hard to find errors being part of all software out there. To get an appreciation for the complexities involved, have a look at the Java 8SE DateTimeFormatter documentation. Pay particular attention to the meanings of the formatting characters Y, y, m, and d.

The Leadup

Twitter was available via a number of clients, SMS (yes, really), Web, native iOS, and native Android clients. The most prominent one in this story was the native Android client. The diversity of hosting environments in the Android ecosystem meant that it was the most variable one supported. The variability in hardware, operating systems, software, cell providers and geographically diverse locations was the greatest in our native Android client. One way this manifested itself was on this client, there was no really singe reliable way to have a concept of “the current time” on the client. Some of the Android devices were simply horrible at keeping any sort of accurate time, or even progression of time.

So, on the Android clients we had code where we would use a combination of the device’s local time, as well as the date and time in the headers from the HTTPS API endpoint as it responded to any API calls from the device, to fabricate a local “current time”.

Twitter, as a social network, existed in the global community. It crossed significant boundaries, from socio-economic, to geographic, to political. Many of the places Twitter was used, it was a trusted source of information beyond, and sometimes in spite of, the officially sanctioned sources the local government provided or allowed. As such, Twitter had to contend with any number of rather sophisticated threats against both itself and its users. Any quick Google search will reveal any number of investigations and reports.

The Setup

As part of hardening the client to API endpoint connection, several options were implemented as a strategy for a “defense in depth” approach to securing our users. One of these was that a client would be logged out (access and/or session token revoked) if things looked “weird and/or out of place”. Any number of things could make this happen. In general, the user of the session would simply re-authenticate with the login server, using whatever authentication mechanism they set up. This could include 2 factor challenges, and other similar mechanisms. One of the items that was checked, and would cause a revocation of the session ticket, was the client’s idea of what date the client believed it was. This check was pretty loose. I seem to recall that it took a multiple day difference between the client and the server for the session to be revoked.

The next piece I’m a little fuzzy on, it’s been 10+ years! However, if I recall correctly, this particular Twitter API endpoint was terminated by a Java HTTP/S server. As part of the response the API code generated the current date and time to send back as part of the response headers:

HTTP/1.1 200 OK
Date: Sun, 26 Jan 2025 20:57:16 GMT

The code that did this used industry standard methods and objects to format this response. In particular, it used the format string: E, d MMM YYYY HH:mm:ss z. The astute among you, or the ones that read the Java DateTimeFormatter docs in depth, and remembered and understood all the format characters will be rather triggered by this format string. TL;DR – YYYY is very different from the correct yyyy. Note, for most of any given year, YYYY and yyyy are the same.

The Trigger

The end of the 2014 year was a time many Twitter engineers took a well deserved rest. The previous year was one of immense growth (we went public in November of 2013) as well as many significant events. We had survived the World Cup, Hong Kong protests, to #BlackLivesMatter protests. Every one of these events cluminated in our infrastructure becoming more resilient as we had finally mastered the fail-whale. As such, many of our engineers took an extended vacation around the end of year holidays. Deployments were frozen, and only a few SRE, TCC, and other on-call folks were around.

As the calendar below shows, by taking off a few strategic days, you could end up getting 15 days off.

December 2014 Calendar

Everything was humming along smoothly until Monday December 29th, GMT. At this point, our authentication service was unable to keep itself upright. The load on this service quickly increased to the point where a task was unable to keep running, not even to get healthy. At first we thought that we had some type of code issue, but that was quickly ruled out, as no rollouts had been happening. Next up was the thought that we were being attacked with some type of denial of service attack. This took significantly longer to rule out.

Dumping headers of the affected traffic showed that clients were sending the wrong date to the API endpoint. They were sending a date of 2015-12-29 in the year 2014. At this point I was pretty sure what was going on. A quick code search confirmed my suspicions. Someone had used the “ISO Year with week” version of the year formatting character to format the date. As one can see in the table below, the year as reported by the different format characters was “incorrect” for the 29th, 30th, and 31st of December in 2014.

Actual Date Numeric Date ISO Year/Week
Sun, 21 Dec 2014 2014-12-21 2014/51
Mon, 22 Dec 2014 2014-12-22 2014/52
Tue, 23 Dec 2014 2014-12-23 2014/52
Wed, 24 Dec 2014 2014-12-24 2014/52
Thu, 25 Dec 2014 2014-12-25 2014/52
Fri, 26 Dec 2014 2014-12-26 2014/52
Sat, 27 Dec 2014 2014-12-27 2014/52
Sun, 28 Dec 2014 2014-12-28 2014/52
Mon, 29 Dec 2014 2014-12-29 2015/1
Tue, 30 Dec 2014 2014-12-30 2015/1
Wed, 31 Dec 2014 2014-12-31 2015/1
Thu, 01 Jan 2015 2015-01-01 2015/1
Fri, 02 Jan 2015 2015-01-02 2015/1
Sat, 03 Jan 2015 2015-01-03 2015/1

This was then sent to the client as part of the API response headers, which was then used when the Android client was attempting to devine the correct current date and time, which then resulted in the client sending a date and time that was a year in the future with the next API call. The server threat mitigation mechanisms at this point kicked in, causing the session to be revoked, which ultimately lead to the client requesting a new login.

As all the Android clients worldwide began to have their sessions revoked, they all attempted to connect to our login service. A service that was not sized for this increase in traffic.

The Mitigation

Once we understood the trigger and root cause of the outage we moved into mitigation mode. This was significantly troublesome due to several factors:

It was a bit frustrating that the only lever we could identify was to globally load shed all traffic hitting the authentication service. By load shedding traffic destined to the authentication service before it hit the authentication service, we could ensure that the authentication service could stay operational and survive. At the peak of the mitigation, we were shedding more than 95% of the load on the authentication service. Note, at this point the authentication service had been scaled up by over 20x and was still not able to handle the increased load. Also note that this mitigation impacted all our clients, iOS, Web, etc.

This particular outage required an emergency patch, build, and deployment of the API endpoint. Once that was rolled out, the system recovered quickly.

The Incident Review

This incident brought a number of issues up in the review process. While a recurrence of the exact same instance was unlikely (or at least a year away), there was some significant risk that a different set of circumstances could lead to a very similar incident. The items that I recall being identified as ripe for systemic fixing were:

  1. Search the whole codebase for the string YYYY in all Java source code and ensure that it was the correct string (depending on the context). We found multiple instances in the code that were incorrect. I vaguely remember that out of all the instances found, only a single one was correct.
  2. Educate engineers on the “ISO Year with week-of-year” part of the Date/Time libraries that most languages have.
  3. Create an automatic test that runs as part of code review to identify potentially troublesome code with the incorrect format characters.
  4. Enable per-service load shedding for the authentication service. Ensure other outside facing API services had load shedding enabled.
  5. Design and implement a way to spread out the revocation of session tokens due to this type of condition. In particular identify identify and use risk as a method to delay, randomize, and spread out revocations.
  6. Create a lever where we could gain local control of the rate of revocations, including being able to turn off the feature completely.
  7. Move the authentication service to have a standardized build and deployment process.
  8. Ensure oncall rotations did not schedule employees on vacation to be on call.

Note: One of the reasons this string format is so insidious IMHO, is that the string YYYY-MM-dd is incorrect, while yyyy-MM-dd is correct. From a human estetic, it just seems incorrect to have your capitalization be in this weird “inconsistent” state.

Conclusion

Dates (and time) are hard. If you don’t have to use them, manipulate them, or use them in any fashion, you will do yourself (and others) a huge service by not touching them in code. Pretty much any time you are using a point in time, compare times and dates, or use a duration, you’re opening up a large set of potential errors. As such, if you can avoid them, do so.

If you absolutely need to use the concept of time within your code, ensure you practice robust coding techniques. Use industry standard packages, libraries and utilties. Fully understand their correct use cases. Write plenty of tests for all edge conditions. Write internal APIs and libraries tailored to your company’s needs to minimize the flexibility of the interface surface and minimize the possibility for mistakes and errors.