Google’s recent cloud outage caused by updates stuck in canary test deployment

At the end of last month, newly created Google Compute Engine instances, cloud VPNs and network load balancers were left unavailable for just over two hours, in a wide scope incident.

The affected servers had public IP addresses, but couldn’t be reached from the outside world or send any traffic. Nor would load balancing health checks work.

“[W]e apologize for the wide scope of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself,” Google said.

Updates stuck in testing phase

The Alphabet subsidiary described the root cause of the outage as being caused by “a large set of updates which were applied to a rarely used load balancing configuration…exposed to an inefficient code path which resulted in canary [testing deployment] timing out.”

Further updates were then queuing up behind the timed-out changes. Though, Google was able to resolve the issue in a few hours.

“To prevent this issue from reoccurring in the short term, Google engineers are increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them,” it said.

A further long-term solution has been announced: “new tests are being written to test behaviour on a wider range of configurations.”

Engineers are also producing additional metrics and alerting that will allow issues such as this one to be identified sooner, leading to faster resolution.

Canary is Google’s latest Chrome download, and the firm warns: “it’s designed for developers and early adopters, and can sometimes break down completely.”

 

Edited from source by Cecilia Rehn.

More
articles