A Single Point of Failure Triggered the Amazon Outage Affecting Million

An anonymous reader quotes a report from Ars Technica: The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon’s sprawling network, according to a post-mortem from company engineers. […] Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.

In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. […] The failure caused systems that relied on the DynamoDB in Amazon’s US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.

The damage resulting from the DynamoDB failure then put a strain on Amazon’s EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a “significant backlog of network state propagations needed to be processed.” The engineers went on to say: “While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation.” In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center. Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer.

Further reading: Amazon’s AWS Shows Signs of Weakness as Competitors Charge Ahead


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *