Cloudflare Breaks Down The Tech Behind Massive Global Internet Outage

On November 18, 2025, Cloudflare, a key player in global internet infrastructure, faced a major outage that disrupted services for millions of users worldwide.

Starting at 11:20 UTC, the company’s network began returning HTTP 5xx error codes, blocking access to customer websites and apps hosted on its platform.

This incident affected high-profile sites like X (formerly Twitter), ChatGPT, and Canva, causing widespread frustration as users encountered error pages instead of content.

The outage lasted several hours, with core services recovering by 14:30 UTC and complete restoration by 17:06 UTC.

Cloudflare CEO Matthew Prince detailed the root cause in a blog post, emphasizing it stemmed from an internal configuration error rather than a cyberattack.

Initially, teams suspected a massive DDoS attack due to fluctuating error patterns and a coincidental outage of the company’s externally hosted status page.

What Triggered The Failure

The problem originated in Cloudflare’s ClickHouse database cluster, used for analytics and bot detection.

At 11:05 UTC, engineers updated permissions to enhance security for distributed queries, explicitly granting users access to the underlying tables in the “r0” database.

This change aimed to run subqueries under each user’s account, improving isolation and preventing a faulty query from affecting others.

However, the update exposed metadata from both the “default” and “r0” databases in queries such as SELECT name, type FROM system. columns WHERE table = ‘http_requests_features’. Previously, such queries returned only columns from the default database.

However, now they include duplicates from r0, more than doubling the output rows.

This query fed into the generation of a “feature file” for Cloudflare’s Bot Management system, a machine learning tool that scores requests to distinguish bots from humans based on traits such as IP patterns and user agents.

The feature file, refreshed every five minutes, ballooned in size from about 60 features to over 200 due to these duplicates.

Propagated rapidly across Cloudflare’s edge servers, it hit a hardcoded memory limit in the core proxy software known as FL (Frontline) and its newer FL2 version for performance optimization via preallocation.

When the Bot Management module tried to load the oversized file, it panicked with an unhandled error: “thread fl2_worker_thread panicked: called Result::unwrap() on an Err value,” triggering 5xx errors for any traffic relying on the proxy.

The phased rollout of the database update caused intermittent failures: good files from unaffected nodes allowed temporary recovery.

In contrast, bad ones from updated nodes caused crashes.

This oscillation, combined with increased CPU usage from debugging tools that added latency to responses, masked the issue and fueled suspicion of a DDoS.

Impacted services included core CDN and security, Turnstile (captcha), Workers KV (key-value store), Dashboard logins, Email Security (reduced spam detection), and Access (authentication failures).

On FL2, users saw 5xx errors; on legacy FL, bot scores defaulted to 0, potentially blocking legitimate traffic due to false positives.

Path To Recovery and Lessons Learned

Engineers detected the issue via automated alerts at 11:31 UTC and launched an incident response by 11:35 UTC.

At 13:05 UTC, they bypassed the proxy for Workers KV and Access, reducing downstream errors. By 13:37 UTC, focus shifted to rolling back the feature file to a known-good version.

At 14:24 UTC, propagation of new files halted, and a clean file was manually inserted into the queue.

Testing confirmed recovery, leading to global deployment by 14:30 UTC. Remaining services restarted amid traffic surges, fully normalizing by 17:06 UTC.

Cloudflare called this its worst outage since 2019, apologizing for the disruption to the internet ecosystem.

Future fixes include treating internal config files like untrusted input with validation, adding global kill switches, curbing debug overload, and reviewing proxy failure modes.

These steps aim to bolster resilience in their anycast network, handling 20% of global web traffic.

Cloudflare Breaks Down The Tech Behind Massive Global Internet Outage

What Triggered The Failure

Path To Recovery and Lessons Learned

Recent News

Burp Suite Supercharges Its Scanning Capabilities With React2Shell Vulnerability Detection

Malicious MCP Servers Enable New Prompt Injection Attack To Drain Resources

Law Enforcement Detains Hackers Equipped With Specialized Flipper Hacking Tools

Google Unveils 10 New Gemini-Powered AI Features For Chrome

CISA Alerts On Actively Exploited Buffer Overflow Flaw In D-Link Routers

Over 500 Apache Tika Toolkit Instances Exposed To Critical XXE Vulnerability

Recent News

Burp Suite Supercharges Its Scanning Capabilities With React2Shell Vulnerability Detection

Malicious MCP Servers Enable New Prompt Injection Attack To Drain Resources

Law Enforcement Detains Hackers Equipped With Specialized Flipper Hacking Tools

Google Unveils 10 New Gemini-Powered AI Features For Chrome

CISA Alerts On Actively Exploited Buffer Overflow Flaw In D-Link Routers

Over 500 Apache Tika Toolkit Instances Exposed To Critical XXE Vulnerability

About us

Company

The latest

Burp Suite Supercharges Its Scanning Capabilities With React2Shell Vulnerability Detection

Malicious MCP Servers Enable New Prompt Injection Attack To Drain Resources

Law Enforcement Detains Hackers Equipped With Specialized Flipper Hacking Tools

Subscribe