Degraded API Performance

Incident Report for Edlink

Postmortem

Date of Incident: May 4, 2026

Duration: 12:07 PM – 1:04 PM (57 Minutes)

Impact: Total interruption of user login capabilities.

Status: Resolved

Executive Summary

Between 12:07 PM and 1:04 PM today, Edlink experienced a severe API degradation that prevented substantially all users from logging into client products via SSO. The incident was caused by database connection starvation, which was triggered by a service deployment that utilized an older pinned version of the PostgreSQL (PG) library. The issue was mitigated by bringing down the problematic service to free up connections, followed by a full revert of the change. Normal operations have been fully restored.

Impact

  • User Experience: Substantially all users attempting to access Edlink during the 57-minute window were completely unable to log in or authenticate. A small portion were able to authenticate normally.
  • System Impact: Critical services, including the authentication/login service, were starved of database connections and timed out or dropped requests.

Root Cause

The incident was traced back to a deployment that occurred today at 12:07pm CT where a part of our service was upgraded.

Background on Dependency Version Pinning:

To provide context on why an older version was in use, we recently implemented a strict policy for managing our software dependencies. This proactive security measure was enacted following an industry-wide supply chain attack incident in March of 2026. Edlink was not affected by this incident directly, but we determined that the attack vector was not one that we were comfortable with, and as such, we moved to proactively update our systems.

To safeguard Edlink's platform and ensure our systems remain secure, we instituted a policy to "pin" or lock all our software dependencies to explicitly verified, “safe” versions. This prevents our systems from automatically downloading new, potentially unverified updates. This is the attack vector that was used in the industry-wide attack a few weeks back.

While this policy is crucial for preventing malicious, unverified updates from automatically infiltrating our systems, in this specific instance, it had an unintended side effect. In this situation, it forced our deployment to utilize an older version of the PG library than we had intended.

This older version of the library managed Postgres connections highly inefficiently and would leak unused connections. Upon deployment, the service rapidly consumed the available database connection pool. Because the connections were not being properly released or managed, other critical infrastructure (notably the service handling user logins) was left without available database connections, leading to the system-wide login failures.

Timeline

  • 12:07 PM: The deployment containing the outdated PG library goes live. Database connection pools immediately begin to spike. Users start experiencing login failures.
  • ~12:45 PM: Emergency mitigation is enacted: the problematic service is brought down, immediately releasing its held connections back to the pool and restoring baseline functionality for the login service.
  • 1:04 PM: The deployment is officially reverted to the previous stable state. System stability is confirmed, and the incident is closed.

Resolution and Recovery

The immediate bleeding was stopped by intentionally spinning down the problematic service, which allowed the authentication services to recover. The underlying code change was then completely reverted in version control and redeployed to ensure the service could be brought back online safely without causing a secondary outage.

Preventative Measures

To ensure we can more quickly identify and resolve similar issues in the future, we have integrated additional monitoring tools focused on detecting failed logins. This will allow us to immediately detect authentication issues and mitigate them before they cause a prolonged system-wide impact.

Additionally, we have conducted a review of any other dependency packages that may have been inadvertently downgraded during this “pinning” process.

Posted May 04, 2026 - 21:22 UTC

Resolved

We've identified the root cause and the pipeline has resumed. We will publish a full post-mortem shortly.
Posted May 04, 2026 - 19:57 UTC

Update

We've isolated the issue to our sync service that retrieves data from school data sources. We are re-enabling enrichment & materialization. We are going to revert to the previous stable version of the sync application and re-enable it in a limited capacity for monitoring.
Posted May 04, 2026 - 18:30 UTC

Monitoring

Reports of degraded API & SSO performance started coming in around 12:30pm CT.
After investigation our logs, it appears that the degraded performance started at exactly 12:07pm Central Time (about 27 minutes prior to first reports).
As of 1:04pm Central Time, we have reallocated database resources and the immediate issue has been resolved. SSO & API requests have returned to expected levels.
We are continuing to investigate the root cause of the issue before restoring normal pipeline services.

Please expect delays in sync and materialization queues while we investigate the issue.
Posted May 04, 2026 - 18:16 UTC
This incident affected: Edlink Core Systems (Edlink API, Edlink Dashboard).