US Region - Database Connections Issue

Incident Report for Edlink

Postmortem

On March 14th at approximately 2:00 AM Central Time, the Edlink API experienced a period of degraded performance, followed by a service outage.

First and foremost, we apologize for the experience our clients, and their customers, may have had during this time. We know you depend on us to enable access (and more) to your products and platforms, for your teachers and students, and we take that privilege very seriously.

The Upshot

The API experienced degraded performance starting around 2:00 AM CT.
The API began dropping a substantial number of requests around 7:00 AM CT.
The service outage lasted about 2 hours, from 7:00 to 9:06 AM CT.
The incident affected only the United States region (not our international regions, or our metadata database that powers much of the Edlink Dashboard).
No school data was lost or corrupted, this was purely an API issue. However, we did disable our API request logging feature for a few hours to reduce load on our system while we investigated the issue. You may notice missing API requests on the Logs page of our Edlink Dashboard.

Background

Partitions

As part of our service, Edlink stores a number of time series data points for our clients. For example, we store all changes to our primary dataset (e.g. when a new person is created) and logs of all inbound and outbound API requests. In general, we offer a 30-day retention period for this data so it doesn’t grow out of hand.

In order to deliver better performance on these datasets, we use a common database pattern called “partitioning” in order to segregate data by day. This improves query times and allows us to quickly and easily drop old data after the retention period. Each day, a new partition is created and an old one is dropped. Specifically, as our databases are all powered by PostgreSQL, we utilize the built-in Postgres partition feature.

The process of creating new and dropping old partitions is managed by an automated process called PG Partman. This process typically runs every 15 minutes without issue. Occasionally, the process will be interrupted for some reason and may fail. This failure case is usually not very interesting and the worst-case scenario is that data gets added to a “default” partition until the correct partitions are created.

The partition management process began failing regularly a couple of weeks ago and as it is a somewhat low-priority issue for us, we began to address it earlier this week.

The process of fixing the partition manager is pretty straightforward but highly manual. Essentially, we have to migrate data from the “default” partition to the correct partition (depending on the day the event occurred). What added some complexity to this process is that, given our current scale, two of our tables, event deltas, and inbound requests, have grown quite large.

Over the past few days, the engineering team has been working on restoring the request and event delta partitions to their correct state. This process has been pretty slow because we’ve only been doing it late at night. The process itself briefly locks the table (so nobody else can read and write), so we have to perform this maintenance late at night when traffic is minimal. We completed the migration of the inbound requests table last night around 12:00 AM CT and decided to wait an additional day to complete the migration of the event deltas table.

PG Partman

The critical error that led to the service degradation (and ultimately to the outage) was that we did not think to disable the automated partition management process, PG Partman. It had been running and erroring (successfully) for the past few weeks and we incorrectly figured that this would continue. As we had cleared up the inbound requests table (and a couple of others), the process moved onto the event deltas table where it proceeded to get stuck with the table in a locked state.

Degraded Service

The lock on this table began at around 2:00 AM. Our US regional database is large enough that this had little impact on overall service (as far as most users were concerned). The trouble would start when traffic would pick up (i.e. a client trying to sync roster or event data from Edlink overnight). When these spikes would occur, the US region would quickly run out of available database connections. Some requests would hang and ultimately time out.

The problem became more serious (and obvious) a little before 7:00 AM CT when more teachers and students began to access the platform. By 7:00 AM CT, the majority of API requests that involved student data in the US Central region were timing out, and Edlink’s US region was substantially down.

Initial Attempts at Resolution

Our automated API uptime checking relied on only our API servers and not access to our regional databases. As our API servers were technically up and responsive throughout the night, we were not alerted by our downtime monitor. We started investigating when initial reports came in from clients that their schools were unable to sign in around 8:00 AM CT.

We quickly identified that the outage was caused by our database running out of open connections, but we were unable to connect to the US region locally (due to a lack of connection slots), so we sought to shed as many connections as possible. First, we moved to turn off all non-essential services:

We temporarily disabled a number of Google Cloud Functions that were running.
We temporarily disabled all data pipeline services, including syncing and materialization.
We temporarily disabled our partition maintenance function.
We temporarily reduced the number of active API nodes handling requests.
We temporarily disabled request logging

Ultimately, none of these strategies were able to free up connections - at least not immediately. They probably would have worked over the next few minutes as our connection pooler closed idle connections, but we didn’t have time to wait.

Resolution

We ultimately decided to restart the US database, thereby forcing a closure of all open connections. The database took approximately 15 seconds to restart, after which API access was restored at 9:06 AM CT. Because we had disabled our partition maintenance functions in the interim, we were not at immediate risk of the same issue recurring.

We connected to the database locally and confirmed that no requests were being blocked. After this, we slowly started bringing non-essential services back online at around 9:15 AM CT.

Finally, after we were confident that systems were performing as expected, we resumed logging API requests to the database at approximately 11:50 AM CT.

Aftermath

We rebuilt the default partitions to ensure the query planner could accurately plan queries to our request logs table.
We are still working on migrating the data for event deltas.
We will re-enable the automated manager afterward and watch to make sure the process completes as expected.
We introduced a monitoring dashboard to catch similar events in the future.

Mitigation

We intend to make several changes to our product and operational procedures in light of the issues this morning. The reality of this situation is that a small issue became dramatically more impactful to clients for two major reasons:

Our automated outage detection systems were not fully monitoring all of the things that could go wrong with our API.
Our first responders were not fully equipped to resolve the issue and it had to be escalated, which was tricky given the time of day and the size of our team.

Partition Management

Review all partition management settings and correct any errors.
Create alerts for when our partition manager fails to run as expected on its cadence.
Escalate the issue if it fails for more than 24 hours consecutively.

Incident Escalation

We plan to upgrade our incident monitoring functions to test a wider variety of API functionality.
We plan to implement a new series of triggers that are based on general anomalous logging events (e.g. when a certain percentage of API requests fail for any reason).
We will create a training guide for all engineers that contains a list of troubleshooting tasks they can perform at their level of access to our systems.

General Changes

An additional change that we are going to make is shortening our default query timeouts. We believe that this will improve future issues of this type by more quickly freeing database connections if they are held up by a locked table. This will allow other queries to continue without effectively bringing down the database.

Posted Mar 14, 2024 - 14:54 UTC

Resolved

As of approximately 9:09am CST Our core services are back online, including the API and dashboard. We're bringing up auxiliary services (syncing, materializations etc) slowly over the next half hour. We're working on investigating exactly what happened and for how long. Till we've completed our investigation all of our resources will be dedicated to ensuring services continue to stay online as expected. We'll revert back with more details when we have them.

In the meantime, please don't hesitate to report anything unusual that you experience.

Posted Mar 14, 2024 - 14:09 UTC

Investigating

Several clients are reporting that their integrations are not loading. We believe the issue is that the way our database is configured has resulted in our reaching the maximum number of connections it can support. We're working on reconfiguring it now.

Posted Mar 14, 2024 - 13:32 UTC