FBB chat regression — 17% of POST requests CPU-killed since 5/14 01:06 deploy
When: 2026-05-14 19:25 UTC Audience: Patrick. Decision-ready brief on a revenue-affecting product bug found via observability sweep. Status: documented + emailed. Not patched. Recommended fix is one line; I want your read first because it touches a worker actively serving paying users.
The finding
I pulled FBB Cloudflare Workers observability today as a routine cache-hit-rate check — the 5/13 deploy added extended-cache-ttl-2025-04-11 with cache_control on system + last-user-message, and the state file flagged "cache path unexercised by current traffic mix" so I wanted to see actual cache hit/miss numbers.
What I found instead, by accident:
| Window | outcome=ok |
outcome=exceededCpu |
Failure rate |
|---|---|---|---|
| Prior 24h (5/12 19:18 → 5/13 19:18 UTC) | 19,331 | 227 | 1.16% |
| Last 24h (5/13 19:18 → 5/14 19:18 UTC) | 21,524 | 4,550 | 17.4% |
A 20× regression on the fetch-handler failure rate. 4,541 of the 4,550 exceededCpu errors are on POST / — the chat endpoint. Two on /api/memory. Zero on anything else.
Hourly time series (last 24h, exceededCpu count):
2026-05-13 19 132 █████████████
2026-05-13 20 274 ███████████████████████████
2026-05-13 23 506 ████████████████████████████████████████
2026-05-14 00 612 ████████████████████████████████████████
2026-05-14 04 411 ████████████████████████████████████████
2026-05-14 05 405 ████████████████████████████████████████
2026-05-14 06 425 ████████████████████████████████████████
2026-05-14 12 301 ██████████████████████████████
2026-05-14 18 333 █████████████████████████████████
Spread across the whole day, not bot-fanout: the top ASNs are Verizon (106) / TDC Holding Denmark (104) / Datacamp datacenter (104) / Cox (29) / Comcast (14) etc. Real users hitting a broken chat endpoint.
The kills happen at remarkably consistent timing — sample of three:
path=/ wall=8510ms cpu=35ms
path=/ wall=8620ms cpu=10ms
path=/ wall=8362ms cpu=10ms
Wall time ~8.5s, CPU time 10–35ms. The reported cpuTimeMs is the request-handler CPU before the response is returned — it doesn't account for the streaming IIFE. The kill seems to happen ~8.5s into the streaming-tail phase.
The user-visible symptom
When this fires, the user sends a message, the chat response starts streaming (because the response is returned with readable already), but the stream stops mid-response or never produces tokens. The client-side retry logic catches [connection lost] and tries once more — and if the retry also lands in the broken path, the user sees the manual-regenerate note.
So the impact is a fraction of chats producing partial / no responses, with the auto-retry masking some of it. From Cloudflare's view, the worker invocation outcomes show ~17% failing.
What changed
The Workers script was last modified 2026-05-14T01:06:10 UTC (commit 471d2e6, "FBB tier-aware routing: $5 supporter vs $15+ patron"). That commit only changed tier-routing logic — it didn't touch the streaming pipeline or prompt-caching code.
But the 5/13 commit (4ff79a8, "FBB: Sonnet 4.6 patron+crisis, no caps, prompt caching, regenerate fix") added:
- Direct-Anthropic streaming (replaced OpenRouter for the patron + crisis path)
extended-cache-ttl-2025-04-11beta withcache_controlon system + last user message- The structured
sonnet_calllog line at end of stream
The current production worker contains both. The exceededCpu regression matches the deploy timing of these.
Hypothesis: missing ctx.waitUntil() on the streaming IIFE
In index.js:514–632, the chat handler:
- Returns
new Response(readable, ...)immediately at line 629. - Launches a fire-and-forget IIFE
(async () => { ... })()at line 519 that reads the Anthropic SSE stream, writes text deltas to the TransformStream writer, and at end logs the structuredsonnet_callJSON line. - The IIFE is not wrapped in
c.executionCtx.waitUntil(...).
Cloudflare Workers' contract: after the response object is returned, any background work not registered with waitUntil() can be terminated by the runtime. With streaming responses, the runtime keeps the worker alive while the body is actively flowing — but there's a documented edge where if the runtime considers the request "done" (client disconnected, response stream paused, or some other heuristic), pending background promises get aborted and the invocation is labeled exceededCpu.
Two pieces of evidence for this hypothesis:
- The structured
sonnet_calllog line never reaches observability, even though[observability] head_sampling_rate = 1.0is set inwrangler.prod.toml. I searched the last 24h:0events matchingsonnet_callinsource.message. The other console.error calls inside the same IIFE (Anthropic stream parse error,Empty Anthropic response,Failed to parse memory extraction) — also0matches. - The
exceededCpuoutcome with cpuTimeMs ~10ms is the signature pattern when CF kills a worker on the streaming-tail rather than on real CPU exhaustion.
Proposed fix (one line)
- (async () => {
+ c.executionCtx.waitUntil((async () => {
// ... existing IIFE body ...
console.log(JSON.stringify(logLine));
// ...
- })();
+ })());
This tells the runtime to keep the worker alive until the IIFE completes, regardless of response-stream state. Side effects:
- Logs return.
sonnet_callJSON line + console.errors should start appearing in observability immediately. Re-validates the cache-hit-rate question that motivated this entire investigation. - CPU-budget consumption is honest. The IIFE's CPU is now attributed back to the request. Sonnet-streaming ~4096 tokens isn't CPU-heavy (mostly subrequest waiting), so I don't expect new CPU pressure — but the metric will reflect actual work.
- Fix scope is narrow. Only the streaming-response path is affected. Free-tier (V4-Flash via OpenRouter) goes through a different function — same shape, same hypothesis, same fix would apply.
Risks
- If the hypothesis is wrong (the exceededCpu kills are happening for a different reason), the fix won't help. But adding
waitUntilis harmless even if so — it just makes the runtime promise to keep the worker alive, which it should already be doing while the body is streaming. - Need to apply the same wrapping to the V4-Flash streaming path (separate function) for symmetry.
What I'd do, awaiting your call
- Patch both streaming paths with
c.executionCtx.waitUntil(...)wrapping the IIFE. - Deploy via
wrangler deploy --config wrangler.prod.toml. - Verify within ~1h: pull observability, check (a)
outcome=exceededCpurate falls back near 1%, (b)sonnet_calllog line is reaching the dataset. - Update state file with the cache-hit-rate question — now answerable.
If you want me to ship it, say so. If you want to read the diff first or wait for a different fix shape, say that. I'd not normally write to FBB production without you in the loop, but the regression is real and the longer it sits, the more user sessions break.
Capability gap surfaced
The 5/13 state-file framing — "head_sampling=1.0 active. Cache path unexercised by current traffic mix. Deploy d144d1c0" — turned out to be a calcified claim that didn't survive verification. Cache instrumentation was deployed, but the logs never reached the dataset because of the waitUntil gap. The structured-logging deploy was a self-defeating event — the instrumentation was correct, the logging API call was correct, but the runtime-lifecycle interaction silently dropped 100% of the output.
If past me had verified the new logs were reaching observability the same tick as the deploy (per workers_default_sampling_limits_grep memory), this would have been caught 24h earlier. New rule for memory: any structured-log deploy verifies one log line appears in the dataset within 15 minutes of deploy, before declaring the deploy successful.