Cloudflare Improves Systems After Data Loss Incident

On November 14, 2024, Cloudflare encountered a serious issue that disrupted its logging services for 3.5 hours. During this time, 55% of the logs that were meant to be sent to customers were lost permanently. The outage stemmed from a faulty software update in a service called Logpush, which is responsible for delivering bundled logs to customers.

The update unintentionally signaled to another service, Logfwdr, that no logs were configured for delivery. Although the Cloudflare team identified the mistake and reversed it within five minutes, the error triggered another issue in Logfwdr.

This resulted in all customer logs being processed and sent to the system, instead of only those with active configurations. The overwhelming influx of data caused the system to malfunction, leading to widespread data loss.

 

Why Did The System Fail?

 

The failure was caused by a cascading overload that spread through key parts of Cloudflare’s infrastructure. Logfwdr, the service responsible for forwarding logs, suddenly began sending logs for all customers. This unexpected surge overwhelmed Buftee, a critical system buffer that manages logs separately for each customer.

Buftee is designed to prevent one customer’s data from interfering with another’s, but it was not equipped to handle such a dramatic increase in load. The service began creating buffers at 40 times its normal capacity, far beyond what it could handle. Safeguards intended to prevent this situation were in place but had not been properly configured. This left Buftee unable to function, requiring a complete system reset.

The incident revealed that Cloudflare’s systems were unprepared for the combination of fail-open errors and untested configurations, making the failure unavoidable under the circumstances.

 

How Does Cloudflare’s Log System Work?

 

Cloudflare’s log system is built to handle huge amounts of data wise ease, so that customers receive the logs they need without being overwhelmed by volume. It operates using several interconnected services:

Logpush: Bundles logs into manageable file sizes and sends them to customers at regular intervals.
Logfwdr: Determines which logs to forward and where they should go, based on customer configurations.
Logreceiver: Sorts logs into batches for each customer and forwards them to be buffered.
Buftee: Acts as a buffer, holding logs temporarily to manage variations in system demand and prevent data loss.

 

On a typical day, Cloudflare processes over 50 trillion event logs, with 4.5 trillion sent directly to customers. While this system usually runs smoothly, the incident exposed weaknesses in handling unexpected spikes in demand.

 

 

What Were The Root Causes?

 

Two main issues led to the outage. First, a bug in Logfwdr’s configuration system caused it to misinterpret its settings, believing no customers had logs configured for delivery. This issue alone would have been manageable, but it triggered a second, hidden bug in Logfwdr’s fail-safe mechanism.

This fail-safe, designed to prevent data loss, instead sent logs for all customers, overwhelming the system. The resulting flood of data caused Buftee to create an unmanageable number of buffers. Although Buftee had built-in protections to handle such situations, they had not been configured, leaving the system vulnerable.

Cloudflare admitted that these failures were foreseeable but not adequately tested. The lack of configuration and testing led to a situation where systems designed to prevent failure ended up contributing to it.

 

What Is Cloudflare Doing To Prevent This?

 

Cloudflare is doing what they can to avoid similar incidents. Automated alerts will now be implemented to quickly identify and deal with misconfigurations before they escalate. These alerts are expected to catch problems that might otherwise go unnoticed.

The company is also expanding its testing procedures to simulate overload scenarios, similar to the conditions that caused this failure. These tests will help make sure that safety measures like those in Buftee are configured and functioning as intended. Also, Cloudflare is revisiting its fail-safe mechanisms so that they respond appropriately during unexpected events.

These measures are designed to make the system more resilient and reduce the likelihood of similar problems in the future.

 

How Does Cloudflare Acknowledge Its Responsibility?

 

Cloudflare has taken full accountability for the outage. The company admitted that while the tools to prevent such an incident were already in place, they had not been configured or tested properly.

Comparing the situation to having a seatbelt but failing to buckle it, Cloudflare acknowledged that the safeguards were useless without proper activation.

In an apology to its customers, Cloudflare detailed the causes of the failure and outlined what it is doing taking to improve.