Cloudflare Engineers Uncover Hidden ClickHouse Bottleneck Threatening Billion-Dollar Billing Pipeline
Cloudflare discovers hidden ClickHouse bottleneck slowing billion-dollar billing pipeline; three patches fix the issue and enable per-namespace data retention.
Billing Pipeline Grinds to a Crawl
Cloudflare’s daily billing aggregation jobs—responsible for generating hundreds of millions of dollars in usage revenue—unexpectedly slowed to a halt after a recent migration. The delay threatened to disrupt invoice reconciliation and downstream systems, including fraud detection.

“It was a big problem when daily aggregation jobs slowed down,” said a Cloudflare engineer who worked on the fix. “Everything we normally check—I/O, memory, rows scanned, parts read—appeared normal. That’s when we knew it was something deeper.”
Hidden Bottleneck Discovered Inside ClickHouse
The bottleneck was traced to a subtle inefficiency within ClickHouse’s internals, specifically in how the database handles per-namespace data sorting. The system, called Ready-Analytics, stores petabytes of data from hundreds of applications in a single massive table, sorted by namespace, indexID, and timestamp.
“We had to dig deep into ClickHouse’s query execution logic to find the culprit,” another engineer explained. “It wasn’t a resource issue—it was a design flaw in our own schema and retention policy.”
Background: The Rise of Ready-Analytics
Cloudflare built Ready-Analytics in early 2022 to simplify data onboarding for internal teams. Instead of creating custom tables, teams stream data into one unified table with a standard schema of 20 float fields, 20 string fields, a timestamp, and an indexID. The indexID is a string that forms part of the primary key, allowing each namespace’s data to be sorted optimally for its queries.
By December 2024, Ready-Analytics held over 2 petabytes of data and ingested millions of rows per second. But its retention policy—dropping partitions older than 31 days—was a blunt instrument. Teams requiring longer retention had to skip Ready-Analytics entirely, opting for a much more complex conventional setup.

The Problem: One-Size-Fits-All Retention
Cloudflare has used ClickHouse for years, predating native Time-to-Live (TTL) features. The company built its own retention system based on daily partitioning. The Ready-Analytics table used a 31-day global retention, which forced teams with legal or contractual obligations to store data for years to build separate infrastructures.
“This restriction meant many use cases couldn’t use Ready-Analytics,” a product manager noted. “We needed a per-namespace retention solution that didn’t require abandoning the platform.”
What This Means for Cloudflare and Users
The three patches written to fix the bottleneck not only restored billing pipeline performance but also enabled per-namespace retention, opening Ready-Analytics to teams that previously had to avoid it. The engineers have documented their approach to share with the ClickHouse community.
“The fix eliminated the hidden bottleneck and gave us the flexibility we needed,” said a lead engineer. “Now teams can set their own retention periods without impacting the entire cluster.”
Cloudflare expects the improvements to accelerate onboarding for internal teams and reduce operational overhead. Users will benefit from more accurate and timely billing, while the company avoids revenue reconciliation headaches.