On March 14th at approximately 2:00 AM Central Time, the Edlink API experienced a period of degraded performance, followed by a service outage.
First and foremost, we apologize for the experience our clients, and their customers, may have had during this time. We know you depend on us to enable access (and more) to your products and platforms, for your teachers and students, and we take that privilege very seriously.
As part of our service, Edlink stores a number of time series data points for our clients. For example, we store all changes to our primary dataset (e.g. when a new person is created) and logs of all inbound and outbound API requests. In general, we offer a 30-day retention period for this data so it doesn’t grow out of hand.
In order to deliver better performance on these datasets, we use a common database pattern called “partitioning” in order to segregate data by day. This improves query times and allows us to quickly and easily drop old data after the retention period. Each day, a new partition is created and an old one is dropped. Specifically, as our databases are all powered by PostgreSQL, we utilize the built-in Postgres partition feature.
The process of creating new and dropping old partitions is managed by an automated process called PG Partman. This process typically runs every 15 minutes without issue. Occasionally, the process will be interrupted for some reason and may fail. This failure case is usually not very interesting and the worst-case scenario is that data gets added to a “default” partition until the correct partitions are created.
The partition management process began failing regularly a couple of weeks ago and as it is a somewhat low-priority issue for us, we began to address it earlier this week.
The process of fixing the partition manager is pretty straightforward but highly manual. Essentially, we have to migrate data from the “default” partition to the correct partition (depending on the day the event occurred). What added some complexity to this process is that, given our current scale, two of our tables, event deltas, and inbound requests, have grown quite large.
Over the past few days, the engineering team has been working on restoring the request and event delta partitions to their correct state. This process has been pretty slow because we’ve only been doing it late at night. The process itself briefly locks the table (so nobody else can read and write), so we have to perform this maintenance late at night when traffic is minimal. We completed the migration of the inbound requests table last night around 12:00 AM CT and decided to wait an additional day to complete the migration of the event deltas table.
The critical error that led to the service degradation (and ultimately to the outage) was that we did not think to disable the automated partition management process, PG Partman. It had been running and erroring (successfully) for the past few weeks and we incorrectly figured that this would continue. As we had cleared up the inbound requests table (and a couple of others), the process moved onto the event deltas table where it proceeded to get stuck with the table in a locked state.
The lock on this table began at around 2:00 AM. Our US regional database is large enough that this had little impact on overall service (as far as most users were concerned). The trouble would start when traffic would pick up (i.e. a client trying to sync roster or event data from Edlink overnight). When these spikes would occur, the US region would quickly run out of available database connections. Some requests would hang and ultimately time out.
The problem became more serious (and obvious) a little before 7:00 AM CT when more teachers and students began to access the platform. By 7:00 AM CT, the majority of API requests that involved student data in the US Central region were timing out, and Edlink’s US region was substantially down.
Our automated API uptime checking relied on only our API servers and not access to our regional databases. As our API servers were technically up and responsive throughout the night, we were not alerted by our downtime monitor. We started investigating when initial reports came in from clients that their schools were unable to sign in around 8:00 AM CT.
We quickly identified that the outage was caused by our database running out of open connections, but we were unable to connect to the US region locally (due to a lack of connection slots), so we sought to shed as many connections as possible. First, we moved to turn off all non-essential services:
Ultimately, none of these strategies were able to free up connections - at least not immediately. They probably would have worked over the next few minutes as our connection pooler closed idle connections, but we didn’t have time to wait.
We ultimately decided to restart the US database, thereby forcing a closure of all open connections. The database took approximately 15 seconds to restart, after which API access was restored at 9:06 AM CT. Because we had disabled our partition maintenance functions in the interim, we were not at immediate risk of the same issue recurring.
We connected to the database locally and confirmed that no requests were being blocked. After this, we slowly started bringing non-essential services back online at around 9:15 AM CT.
Finally, after we were confident that systems were performing as expected, we resumed logging API requests to the database at approximately 11:50 AM CT.
We intend to make several changes to our product and operational procedures in light of the issues this morning. The reality of this situation is that a small issue became dramatically more impactful to clients for two major reasons: