noviembre 20, 2025
511 views
2 mins read

Cloudflare Outage Analysis of November 18, 2025: Causes and Solution

On November 18, 2025, much of the Internet experienced outages due to a failure in Cloudflare's network. In this article, we break down what exactly happened, why it wasn't a cyberattack, and how the incident was resolved.
cloudflare

On November 18, 2025, at 11:20 UTC, users around the world started seeing HTTP 5xx errors when trying to access websites protected by Cloudflare. What initially seemed like a massive attack turned out to be a complex internal error. The company has published a detailed post-mortem analysis explaining the incident.

When a crisis occurs in a company, communication has to become essential to inform all the actors who have to do with it. In technology companies, a detailed analysis of what happened is essential in order to resolve any type of doubt that may have occurred. Cloudflare’s actions have been an example of how to report a security incident.

server error

What caused the fall?

Computer science is based on limits. There is a value allowed for everything. For example, you have a mobile phone with x GB of disk memory or x GB of RAM. Phone companies have limits on browsing after a month, you have a 300 Mb internet connection, etc. These limits allow you to restrict, limit, what a user is allowed to do but in other circumstances prevent it from occurring in an even more serious incident, or allows you to shut down a system in an attack.

Despite initial speculation, Cloudflare has confirmed that this was not a cyberattack or malicious activity. The root cause was technical and originated from a permissions update in your database system:

  1. The trigger: A change in the permissions of a database system (ClickHouse) caused duplicate records to be generated in a «features file«.
  2. The Ripple Effect: This file is vital to Cloudflare’s bot management system. Due to the duplicates, the file size doubled, exceeding the size limit that the network software could handle.
  3. The failure: Upon receiving a larger file than expected, the software in charge of directing traffic on the machines on the network failed, causing the interruption of the service.

Interestingly, the system was periodically trying to recover when a correct file was generated, which caused fluctuations that confused the engineers, leading them to initially think they were under a DDoS attack.

Affected Services

The disruption was not total, but it was widespread, affecting critical components:

  • CDN and Security: 5xx errors visible to end users.
  • Turnstile: It wasn’t loading.
  • Workers KV: Considerable increase in failures.
  • Dashboard: Inaccessible due to dependence on Turnstile.
  • Cloudflare Access: Widespread authentication failures.

Timeline of the Resolution

According to RFC 3227 it explicitly states that one must «note the difference between the system clock and UTC» and «for each timestamp, indicate whether UTC or local time is used.»

Critical Reasons:

Global Event Correlation: In a security incident involving multiple systems in different time zones, UTC allows events to be correlated with absolute accuracy.

Legal Admissibility: Courts require accurate and verifiable timestamps. UTC is the internationally recognized standard, which makes the evidence more admissible.

Avoid ambiguities: Daylight saving time changes, time zone differences, and local settings can create confusion. UTC removes these variables.

Attack Reconstruction: To understand the exact sequence of an attack (especially coordinated or distributed attacks), you need an accurate and unified timeline.

Clock Drift: RFC 3227 specifically mentions «recording system clock drift» – this is crucial because compromised systems can have manipulated or desynchronized clocks.

The chronology of the incident using Universal Earth Time was as follows:

  • 11:20 UTC: Network failures begin.
  • Diagnostics: After discarding an attack, the corrupted file was identified.
  • Workaround: Failed file propagation was stopped, an older valid version was manually inserted, and the central proxy was forced to restart.
  • 14:30 UTC: Main traffic began to flow normally.
  • 17:06 UTC: All systems are back to 100% operation.

Incident Response Plan

NIST (National Institute of Standards and Technology) provides guidelines for managing cybersecurity incidents. These guidelines are documented in the special publication NIST SP 800-61 Rev. 2. The goal is to minimize the impact of incidents and restore operations as soon as possible.

A number of main phases of an incident’s lifecycle can be listed:

1. Preparedness: Implement policies and tools to be prepared.

2. Detection and Analysis: Identify and analyze the incident.

3. Containment, Eradication, and Recovery: Limit damage, eliminate threat, and restore systems.

4. Lessons Learned: Review the incident and improve processes.

Cloudflare’s performance has followed this lifecycle and is an example of how to act.

Conclusion

Cloudflare has apologized for the outage, acknowledging that its network outage is unacceptable given its importance in the internet ecosystem. They have assured that they are implementing measures to prevent a file validation error of this type from bringing down their services again in the future.

Source: Blog of Cloudflare

Avelino Dominguez

Biologist - Teacher - Statistician #SEO #SocialNetwork #Web #Data ♟Chess - Galician

Deja un comentario

Este sitio usa Akismet para reducir el spam. Aprende cómo se procesan los datos de tus comentarios.

cloudflare
Previous Story

Análisis del Apagón de Cloudflare del 18 de Noviembre de 2025: Causas y Solución

brecha de seguridad
Next Story

Las brechas de seguridad: el enemigo silencioso de las empresas digitales

Top

Don't Miss