Edlink Database Unscheduled Downtime

Incident Report for Edlink

Resolved

Due to a previously unknown constraint on our managed database, we suffered several hours of downtime on Thursday morning. Through the course of normal operations, our database disk exceeded 95% capacity. We were aware of the limited disk space and we had already triggered an upgrade of our resources. What we did not know, was that our hosting provider automatically disables writes into the database in order to "prevent disk corruption" when a database reaches this threshold. Our standby (backup) database was not triggered in this instance as the primary node was not "down", it was simply stuck in a read-only mode.

The resolution to this issue was to wait until the database resize was completed. No data was lost or accessed improperly during this time. Once the resize operation was finished, API operations proceeded normally.

While we were already planning a migration to a more distributed database architecture, we were not planning to make the shift for several more months. Due to the extremely rapid growth of our product amongst software companies and school districts, we will be prioritizing this transition. The new architecture will greatly reduce the risk of downtime, improve performance, and allow the addition of database resources without disruption to existing users.

Posted Sep 10, 2020 - 12:30 UTC