That time Google Cloud Platform bricked the Internet…
Updated: July 1, 2025
Summary
The video discusses a recent incident of bad code being pushed into production, causing major outages across popular internet services. Google Cloud took responsibility, apologizing for the disruption, shedding light on the significance of big infrastructure like Google Cloud. Financial implications of such outages are explored, alongside the impact on market share compared to Azure and AWS. The incident was traced back to an API management service issue and a binary crash due to lack of error handling, emphasizing the importance of AI and proper development practices. Additionally, a new policy change triggered chaos, but a rollback procedure with a 'big red button' was implemented to restore normalcy.
TABLE OF CONTENTS
Introduction to Bad Code Incident
Impact of Bad Code on Internet Services
Apology from Google Cloud
Power of Big Infrastructure
Financial Impact of Major Outages
Service Level Agreements with Cloud Providers
Market Share Impact on Google Cloud
Root Cause Analysis of the Incident
AI in Technology and Human Error
Policy Change Leading to API Loop
Rollback Procedure and Recovery
Introduction to Post Hog AI Product
Introduction to Bad Code Incident
Discussion of a recent incident where bad code pushed into production caused major outages across the internet, including Snapchat, Spotify, Discord, and Cloudflare's workers KV service.
Impact of Bad Code on Internet Services
Exploration of the repercussions of the bad code incident on various internet services and websites, leading to significant error rates and downtime.
Apology from Google Cloud
Google Cloud taking responsibility for the bad code incident and offering apologies for the disruption caused to favorite apps and services.
Power of Big Infrastructure
Highlighting the importance and power of big infrastructure like Google Cloud in today's technological landscape.
Financial Impact of Major Outages
Discussion on the financial implications of major outages like the recent incident, which can result in significant losses for companies.
Service Level Agreements with Cloud Providers
Explanation of service level agreements with cloud providers and the criteria for financial compensation in case of violations.
Market Share Impact on Google Cloud
Assessment of the impact of the outage on Google Cloud's market share in comparison to Azure and AWS.
Root Cause Analysis of the Incident
Investigation into how the bad code incident occurred, involving an API management service issue and a binary crash due to lack of proper error handling.
AI in Technology and Human Error
Exploring the role of AI in technology and addressing human errors in code development, emphasizing the need for error handling mechanisms.
Policy Change Leading to API Loop
Detailing a policy change on May 29th, 2025, that triggered an API loop due to a feature not being executed properly, causing chaos and panic.
Rollback Procedure and Recovery
Implementation of a rollback procedure with a 'big red button' to address the incident and restore normalcy after the chaos caused by the bad code.
Introduction to Post Hog AI Product
Introduction to Post Hog AI-powered product, Max, integrated within the Post Hog app to enable various functionalities like natural language questions and feature flags.
FAQ
Q: What was the recent incident discussed in the file?
A: The recent incident was about bad code being pushed into production, causing major outages across the internet affecting services like Snapchat, Spotify, Discord, and Cloudflare's workers KV service.
Q: What is the process of nuclear fusion?
A: Nuclear fusion is the process by which two light atomic nuclei combine to form a single heavier one while releasing massive amounts of energy.
Q: Who took responsibility for the bad code incident and offered apologies?
A: Google Cloud took responsibility for the bad code incident and offered apologies for the disruption caused to various apps and services.
Q: What are the financial implications of major outages like the recent incident?
A: Major outages like the recent incident can result in significant losses for companies that rely on internet services for their operations.
Q: What triggered the API loop on May 29th, 2025, causing chaos and panic?
A: A policy change on May 29th, 2025, triggered an API loop due to a feature not being executed properly, leading to chaos and panic.
Q: What is the purpose of the 'big red button' mentioned in the file?
A: The 'big red button' refers to a rollback procedure implemented to address incidents like the bad code incident and restore normalcy after chaos.
Q: How did the bad code incident occur according to the file?
A: The bad code incident occurred due to an API management service issue and a binary crash caused by the lack of proper error handling.
Q: What role does AI play in technology according to the discussion in the file?
A: AI is mentioned in the file as a tool to address human errors in code development, emphasizing the importance of error handling mechanisms.
Q: What is the Post Hog AI-powered product, Max, used for in the Post Hog app?
A: Max is integrated within the Post Hog app to enable functionalities like natural language questions and feature flags.
Q: How did the recent incident impact Google Cloud's market share compared to Azure and AWS?
A: The file discusses the impact of the recent incident on Google Cloud's market share in comparison to Azure and AWS, though specific details on the impact are not provided.
Get your own AI Agent Today
Thousands of businesses worldwide are using Chaindesk Generative
AI platform.
Don't get left behind - start building your
own custom AI chatbot now!