Kubernetes Upgrade
Incident Report for Edlink
Resolved
We finalized our upgrades to all necessary Kubernetes services.
Posted May 18, 2023 - 11:00 UTC
Update
We continue to work on upgrading additional Kubernetes services and will provide our next update around 11am CST.
Posted May 17, 2023 - 06:23 UTC
Monitoring
Please note that this message was composed by a member of non-technical team (Amanda), and therefore may be technically inaccurate for the time being.

These messages are not reviewed by the technical team until after the incident is resolved, as that is their first priority in the meantime.

A fix was found and we're monitoring the situation.

We upgraded a number of services relating to NGINIX ingress and NGINIX. This allowed us to start serving traffic again.

We are still working on upgrading a handful of other services but we believe that we will remain online throughout the upgrades from this point forward.

Assuming that we do not experience any further downtime we will provide our next update once we finish the upgrades, with a postmortem to follow.

We're grateful to our team and our clients for their patience while we work diligently through the night to ensure the issue is resolved and complete the remaining upgrades. We apologize for any inconvenience this may have caused.
Posted May 17, 2023 - 04:30 UTC
Identified
Please note that this message was composed by a member of non-technical team (Amanda), and therefore may be technically inaccurate for the time being.

These messages are not reviewed by the technical team until after the incident is resolved, as that is their first priority in the meantime.

We're quite certain at this point that the issue is Kubernetes.

Tonight at 7:12pm CST, Google updated their status page saying that the Legacy Image API was returning high number of errors that the US region we use, Iowa, was affected. No workaround was provided.

Google continued to provide updates throughout the incident, which you can read here:
https://status.cloud.google.com/incidents/LTRFobVHV4eSL5vfgasv#RP1d9aZLNFZEJmTBk8e1

As of 8:10pm CST Google still did not know the root cause of the issue and as of 9pm CST Google still had no ETA for mitigation of the issue.

At 9:47pm Google updated their status page to say that the issue had been resolved however, when reviewing our Kubernetes dashboard, we found a note stating that our nodes in the node pool were being updated. It continued:

"For node version upgrades, it typically takes 4-5 minutes to upgrade a single node or longer (e.g., due to pod disruption budget or grace period). For updates to node metadata like labels, taints and tags, it takes less than a minute per node and it does not recreate the nodes or cause any disruption to running workloads."

The estimated time remaining for the update as of 11:10pm CST was 79 minutes.

Further, we've identified that the specific part of our Kubernetes configuration which is not operating properly is NGINIX.

We'll continue to troubleshoot and test until the issue is resolved.

We're grateful to our team and our clients for their patience while we work diligently through the night to resolve the issue. We apologize for any inconvenience this may be causing.

I'll continue to share relevant information as we uncover it.
Posted May 17, 2023 - 04:23 UTC
Investigating
Please note that this message was composed by a member of non-technical team (Amanda), and therefore may be technically inaccurate for the time being.

These messages are not reviewed by the technical team after the incident is resolved, as that is their first priority in the meantime.

At approximately 9:23pm CST the Edlink website, Dashboard and API stopped responding. We were not performing updates at the time and have no reason to believe is is an Edlink-created bug.

The engineers on call are currently investigating the issue but our earliest lead is that Google is performing unannounced maintenance to our Kubernetes clusters.

I'll continue to share relevant information as we uncover it.
Posted May 17, 2023 - 02:23 UTC
This incident affected: Edlink Core Systems (Edlink API, Edlink Dashboard).