Realtime Counter Playbook¶
Why this playbook exists
The "live count" pill in the cosmos nav-bar (and the matching Site Analytics tile + tool-counter + CC "Live Now") broke 4 separate times in 4 days between 13–16 May 2026. Each fix shipped quickly and looked surgical, but the underlying architecture was wrong from day one. This playbook captures the permanent fix (Option A+, shipped 16 May 2026 commit 6a9f6b7b) so the next session that's asked "why is the live counter broken again?" has a clear path.
Read this first before touching functions/api/stats.js (any realtime path) · .github/workflows/realtime-refresh.yml · realtime:* KV keys · or anything that polls /api/stats?realtime=*.
TL;DR¶
GHA cron every 5 min (loops 5× with 55s sleep)
└─ POST /api/stats?refresh=realtime (Bearer ADMIN_PASSWORD)
└─ Calls GA4 Realtime ×2 (no-dim total + per-page breakdown)
└─ Writes env.COSMOS_SUMMARY_KV with 24h TTL:
realtime:active {active, generated_at}
realtime:pages {pages, generated_at}
realtime:last_error (only on failure — never overwrites last-good)
Visitors GET /api/stats?realtime={1|cosmos}
└─ Read from caches.default (L1, 60s per-POP)
└─ Read from env.COSMOS_SUMMARY_KV (L2, globally shared)
└─ Fail closed — NEVER fall back to GA4
GA4 calls/hour: constant ~60 (12 cron runs/hr × 1 successful refresh × ~2 GA4 queries each — and Promise.allSettled means the per-page query failing doesn't block the core counter).
Visitor traffic effect on GA4: zero.
🛑 DON'T CONFUSE WITH the hardcoded tool counter (disabled 20 May 2026)¶
There used to be a separate, completely independent counter on every tool page — the "X articles read / X prompts polished / X plans compared" badge. It was disabled on 20 May 2026 and the entries in data/tool_counters.toml were emptied to tools = []. Do not confuse the two systems and do not recreate the hardcoded one without explicit sign-off.
| Aspect | LIVE counter (this playbook) | HARDCODED counter (REMOVED) |
|---|---|---|
| What | "🟢 X reading now" pill — refreshes every minute | "8,500 articles read" badge — static per-visit |
| Data source | GA4 Realtime via KV cache (realtime:active) |
data/tool_counters.toml (hand-seeded numbers) |
| Files | functions/api/stats.js, worker/realtime-cron.mjs, KV |
data/tool_counters.toml (emptied), tool-counter.js badge block (no-op when no .tool-counter element) |
| Partials affected by removal | None | layouts/partials/tool-header.html, layouts/partials/tool-hero.html — both guard with {{ if gt $counterBase 0 }} so empty TOML = no render |
| Status | ✅ Live and healthy (Option A+ from 16 May) | ⛔ Disabled 20 May 2026 (commit 3e514708) |
Why the hardcoded one was removed (full incident in incident-log.md § 20 May 2026):
- Honesty rule violation — Sush's own
voice-and-tone.mdbrag-allergy rule says "reject self-stat blocks; use voices (testimonials with real handles)." The badge showed the same number to the same user on every visit, which is decorative social proof, not truth. - Refresh pipeline silently broken since at least 1 May 2026 — the GHA cron in
.github/workflows/refresh-counters.ymlwas supposed to refresh the TOML monthly from GA4, butscripts/refresh-counters.pyhad the wrong GA4 property ID hardcoded as default (270121818, while the real site property used bystats.jsis530486519). The 1 May run returned 0 page paths and the validation guard correctly aborted. - Coverage gap — the TOML had 57 entries but the refresh script's
TOOL_PATHSdict only covered 28 of them. Even a property-ID fix would have left ~29 tools (Mind Maps, Agent 365 Planner, all the games, all the calculators) frozen at the floor ofmax(views, 50)forever. - No incrementing on user actions despite the JS comment claiming otherwise —
tool-counter.jsonly animated 0 → base on scroll-into-view. Nothing in the codebase POSTs or updates the count when a user copies a prompt, generates a QR, etc.
What was changed (20 May 2026, commit 3e514708):
data/tool_counters.toml→ emptied totools = []with an explanatory header comment. File kept (not deleted) so Hugo's(index hugo.Data "tool_counters").toolschain stays safe; therangeblock iterates zero times and the partial guard prevents render..github/workflows/refresh-counters.yml→schedule:cron removed;workflow_dispatchkept but gated on an explicitconfirm_resurrect: trueinput, with a clear "DISABLED" header. Cannot silently regenerate.- NOT touched: the partials themselves,
tool-counter.js, CSS for.tool-counter,scripts/refresh-counters.py. Each is an inert orphan whose presence costs nothing and preserves an easy resurrection path if needed.
Past-session misconception to ignore — the 19 May 2026 journal entry (Instruct Builder v5 polish session) finding #6 claimed: "counter '120 instructions generated' is real (registered in data/tool_counters.toml, GA4-seeded across all tools). Removing for one tool would break site-wide consistency. Kept." — That was wrong on both counts. The numbers were never live GA4 data (refresh pipeline broken), and "site-wide consistency" wasn't a reason to keep them once we agreed they were dishonest. The 20 May 2026 investigation + removal corrected this.
If a future session is tempted to re-enable hardcoded counters: stop, re-read the four reasons above. The 4-incident saga that built this playbook was about a counter that was LIVE but FRAGILE. The fix wasn't "replace live with static" — it was "make the live one bullet-proof," which is what 16 May 2026 commit 6a9f6b7b achieved. The 20 May 2026 removal of the hardcoded badge is the complement of that work: live counter = honest + bullet-proof; hardcoded counter = dishonest = gone.
Why we have a live counter at all¶
It's a paid-site presence signal. The cosmos-bar pill (<cosmos-bar> web component, shadow-DOM) appears on every planet site (earth, plainai, agentic, shift, claw, brainbar, guided, cosmos atlas), pulling from a single GA4 property. Same number also drives:
- Site Analytics tile on /site-analytics/
- "Currently exploring" tool-counter pill on the homepage hero
- CC dashboard "Live Now" widget (/cc/list.html)
So one cosmos-bar pill plus one CC dashboard plus one tool-counter all hit the same GA4 property from the visitor's browser every ~45s. Multiplied by N visitors × M POPs, you get the original problem.
The 4-incident history (so you don't propose attempts 5/6/7/8)¶
| # | Date | Commit | What it tried | Outcome |
|---|---|---|---|---|
| 1 | 13 May AM | 9c9069f |
Use no-dim GA4 query for active count | Worse — doubled per-visitor GA4 calls (1→2 in parallel). |
| 2 | 13 May EVE | be9e501e + 2526ab7f |
res.ok guard + CF Cache API + in-isolate coalescing + GHA monitor |
Stopped silent zeros on ?realtime=1; cosmos handler still had no error cache. |
| 3 | 14 May EVE | 9c132f4c |
The proper architecture — KV-backed shared cache, single-flight lock, stale-while-revalidate | Reverted 6 min later (305a3985). CF KV requires expirationTtl ≥ 60s; code used 30s; every lock-write threw; every request returned error:'warming-up'. The architecture was right but ONE platform constant killed it. |
| 4 | 14 May EVE | 7dda5840 |
Mirror the proven handleRealtime error-cache pattern in handleRealtimeCosmos |
Worked first try. Stopped self-DOS. Did not add GA4 capacity. |
| 5 | 16 May AM | 6a9f6b7b |
Permanent architecture — Option A+: decouple ALL public realtime reads from GA4 via KV + scheduled refresher | Architecture correct, scheduler wrong. GHA cron */5 * * * * on shared runners dropped 90%+ of scheduled ticks (1.5–4.5 hour gaps observed 17–18 May). KV would age past 30-min tooStale threshold and the pill would hide for most of the day. |
| 6 | 19 May AM | ea32d198 + 575f957f |
Replace GHA cron with Cloudflare Cron Trigger Worker (aguidetocloud-realtime-cron, * * * * *) |
Shipped + verified. Cron fires reliably every minute on CF infra. cosmos endpoint went from age_s=3644 (1 hr old, pill hidden) to age_s=7-50 (consistently fresh) within 2 min of deploy. |
Why 1–4 were all bandaids: they tried to make the per-visitor → per-POP → GA4 call pattern cheaper or safer. The pattern itself was the bug. With ~3–5 active POPs each caching for 60s, even ONE call per cache-miss × 12 misses/hour × 5 POPs = 60+ calls/hour per endpoint × 2 endpoints (?realtime=1 + ?realtime=cosmos) sharing the same property quota. At any non-trivial traffic level, quota burn was guaranteed. Always was.
Why 5 was right architecture but wrong scheduler: the many→1→many dataflow was correct, but GitHub Actions cron explicitly disclaims reliability. Free-tier shared runners frequently drop scheduled jobs under load, with delays measured in HOURS not minutes. On a */5 schedule that should produce 288 runs/day, we observed ~12 actual runs/day during peak periods — 96% drop rate. The 24h KV TTL kept data alive, but the 30-min tooStale threshold (correctly) hid the pill long before the next run fired.
Why 6 is the permanent fix: Cloudflare Cron Triggers run on CF's own scheduling infrastructure, not shared runners. A * * * * * cron produces ~60 invocations/hour, every hour, with sub-minute jitter. The cron lives on the same CF infra as the Pages function it calls, so there are no inter-platform reliability concerns.
The architecture (16 May 2026 → forever, we hope)¶
Components¶
| Component | Path | Role |
|---|---|---|
| Scheduler (PRIMARY) | aguidetocloud-revamp/worker/realtime-cron.mjs deployed as Cloudflare Worker aguidetocloud-realtime-cron with cron * * * * * |
Fires every minute on CF's own scheduling infra (sub-minute jitter, no dropped runs). POSTs to /api/stats?refresh=realtime with Bearer auth. The only reliable scheduled caller going forward. Deploy via aguidetocloud-revamp/scripts/deploy-realtime-cron.ps1. Workers Observability enabled (100% head sampling, persisted logs, invocation logs on) — see dashboard at dash.cloudflare.com/<account>/workers/services/view/aguidetocloud-realtime-cron/production/observability. Cost: ~1.4k events/day, well under the free-tier 100k/day. |
| Scheduler (FALLBACK) | .github/workflows/realtime-refresh.yml |
Schedule DISABLED 19 May 2026 (see workflow header comment). Kept as workflow_dispatch-only manual fallback for emergencies (e.g. CF Workers outage). |
| Refresh endpoint | functions/api/stats.js → handleRealtimeRefresh |
THE ONLY caller of GA4 Realtime for public-counter data. POST authed. Writes to KV. |
| Public read: site | functions/api/stats.js → handleRealtime |
Reads realtime:active + realtime:pages from KV. Returns {active, pages, generated_at, age_s, stale, error?}. ZERO GA4 calls. |
| Public read: cosmos | functions/api/stats.js → handleRealtimeCosmos |
Reads realtime:active from KV. Returns {totalUnique, scope:'cosmos', generated_at, age_s, stale, error?}. ZERO GA4 calls. |
| Admin read: intel | functions/api/stats.js → handleRealtimeCosmosIntel |
Authed (?intel=1). KV total when fresh + direct GA4 byPlanet. Low traffic, acceptable quota cost. |
| KV namespace | env.COSMOS_SUMMARY_KV (shared with cosmos-summary.js) |
The single source of truth for all visitor reads. |
| Monitor | .github/workflows/realtime-probe.yml (existing) |
Every 15 min. Opens/comments/closes realtime-degraded GitHub issue. |
| Frontend | cosmos-atlas/src/cosmos-bar/component.ts → fetchLiveCount() |
Pollls every 45s. Treats totalUnique numerically. If 0 (e.g. error: 'no-data' or 'data-too-stale') → pill hides via LIVE_MIN_VISIBLE gate. |
KV schema¶
realtime:active { active: number, generated_at: ISO8601 } TTL 24h
realtime:pages { pages: [{page,path,users}], generated_at } TTL 24h
realtime:last_error { message, status, attempted_at, failed_query? } TTL 1h
Critical rule: errors NEVER overwrite the last-good realtime:active / realtime:pages keys. Stale-but-real beats zeros. (Rubber-duck saved us from a subtle Promise.all → Promise.allSettled bug here too: previously a per-page 429 could prevent the core active-count write. Don't undo this.)
Staleness ladder (read handlers)¶
| Age | Behaviour |
|---|---|
| < 3 min | Fresh. stale: false. Pill shows count. |
| 3–30 min | stale: true but value still served. Pill still shows count. |
| > 30 min | active/totalUnique forced to 0. error: 'data-too-stale'. Pill hides. |
| KV empty | error: 'no-data'. Pill hides. |
This means: even if the refresher dies for 30 min, visitors see slightly stale numbers (not lies). Past 30 min, the pill quietly disappears rather than displaying ancient data.
Auth¶
The refresh endpoint uses isAuthedAsAdmin() (in functions/api/_cosmos-shared.js), which:
1. Reads Authorization: Bearer <plaintext>
2. SHA-256 hashes it
3. Constant-time compares to env.ADMIN_PASSWORD_HASH on CF Pages
The GHA secret ADMIN_PASSWORD (plaintext) must match the password whose SHA-256 hash is configured as ADMIN_PASSWORD_HASH on Cloudflare Pages. The same hash is embedded in the public /cc/list.html JS gate (line ~538 — currently 0579d11899...), so you can verify a candidate password locally by computing its SHA-256.
⚠️ ~/.copilot/secrets/guided-admin-password is a different credential. Don't assume it's the CC admin password — I learned this the hard way at 16:05 NZST on 16 May. Test against /api/stats?realtime=cosmos&intel=1 first.
File map¶
| Path | What it does |
|---|---|
functions/api/stats.js |
4 handlers + POST router. The whole realtime pipeline lives here. |
functions/api/_cosmos-shared.js |
isAuthedAsAdmin, cosmosJsonRes, COSMOS_PLANET_KINDS. |
.github/workflows/realtime-refresh.yml |
The scheduler. |
.github/workflows/realtime-probe.yml |
The alerting layer (existing, unchanged). |
~/.copilot/scripts/realtime-probe.py |
CLI version of the probe (for manual checks). |
cosmos-atlas/src/cosmos-bar/component.ts |
The frontend pill (web component, shadow-DOM). |
cosmos-atlas/public/cosmos-bar.js |
Built artefact served from cosmos.aguidetocloud.com. |
Operational runbook¶
"The pill is broken / showing nothing again"¶
- Probe both endpoints first:
error: quota-exhausted→ GA4 quota out. Wait for refill (see below). Don't push code.error: no-data→ KV is empty. Trigger the refresh workflow manually (below).error: data-too-stale→ Cron hasn't run successfully for >30 min. Checkgh run list --workflow=realtime-refresh.yml.error: kv-not-bound→ COSMOS_SUMMARY_KV binding lost on Pages. Check the Pages env config.error: no-auth→GOOGLE_SERVICE_ACCOUNT_KEYmissing on Pages. Check Pages env vars.-
error: no-google-auth(from refresh) → same as above. -
GA4 quota dimensions to check:
- Per-hour: ~1,750 tokens (~175 reports) per property. Refills hourly.
- Per-day: Standard tier ceiling. Refills at midnight Pacific Time = 07:00 UTC = 19:00 NZST.
-
Future improvement: add
returnPropertyQuota: truetoga4RunRealtimeReportcalls so the response includes remaining-token counts. Flagged by rubber-duck on 16 May — not implemented yet. -
Manual refresh trigger (when KV is empty + you want immediate recovery, not waiting for the next 5-min cron):
-
Direct refresh call (if GHA is also broken):
-
Verify KV state via the admin path:
"I need to redeploy the stats.js file"¶
Follow standard deployment discipline (learning-docs/docs/reference/deployment-playbook.md). Specifically:
- Edit
functions/api/stats.js(or workflow). node --checkwon't work directly (ES module) — copy to.mjs:- Validate YAML if you touched the workflow:
- Parallel-safe git: stage explicit paths only.
- Push. Cloudflare Pages deploys are FAST for this repo — observed 30s, not the typical 3–5 min. Don't assume slowness.
- Verify new code with an unauthed POST: should return
{"ok":false,"error":"Unauthorized"}.
"I need to roll back"¶
The architecture is correct; rolling back means re-introducing the bug. Don't.
If you absolutely must (e.g. the refresh handler itself has a regression):
git revert 6a9f6b7bre-introduces the per-visitor GA4 calls. Within ~12 hours of normal traffic, quota will exhaust again. Use only as a stop-gap while you fix forward.- The
?refresh=realtimePOST endpoint can be left in place even on a rollback — it's idempotent and harmless if KV isn't being read.
"I want to add a new realtime metric"¶
E.g. "active users by country" or "active users by event".
- Add it as a third GA4 query in
handleRealtimeRefresh. Append to thePromise.allSettledarray. - Add a third KV key like
realtime:countrieswith the same{value, generated_at}shape and 24h TTL. - Expose it on the read endpoints (probably
?realtime=cosmos&intel=1since byCountry is admin signal). - Do NOT add a new public read path that calls GA4 directly. That's how we got into this mess. Always go via the refresh handler.
- Keep the refresh handler partial-tolerant: a failure of the new query should not block the writes of
realtime:activeorrealtime:pages.
"I want to migrate handleGuided realtime call too"¶
It's the last remaining public GA4-Realtime call. It's behind a 5-min cacheStore in-isolate memory cache and only hit by the guided dashboard (low traffic). Currently acceptable, but if you want "zero public GA4 realtime calls" as an invariant:
- Add
guided:realtime_activeto the refresh handler's writes (4th query, or compute from the existing total). - Make
handleGuidedread from that KV key instead of callingga4RunRealtimeReportdirectly. - This was deliberately deferred from the 16 May PR to keep scope tight.
Lessons learned (the metadata behind the metadata)¶
-
Per-POP caches are not global caches. Whenever a CF Pages Function calls a quota-limited external API, the per-POP
caches.defaultdoes not limit global API load to1 / TTL— it limits it toN_pops / TTL. For any non-trivial property quota, this matters. -
GA4 has multiple quota dimensions. Per-hour AND per-day, on Standard tier. The per-hour one bites first on busy endpoints; the per-day one bites only after sustained abuse (which is exactly what we did). Always read the error message —
"Exhausted property tokens per hour"≠"per day". -
The right architecture for a quota-limited API is "many → 1 → many": many visitors → one global cache → many reads from that cache. The translator is a scheduled writer. This is what every real production system does for this shape of problem; we just took 4 attempts to land it.
-
Fail closed on the public path. A public handler that falls back to direct GA4 when the cache is empty recreates the failure mode the cache was meant to prevent. Rubber-duck called this out twice — adopt it.
-
Never overwrite last-good with errors. Errors go to a separate key with shorter TTL. Stale-but-real ≫ fake-zeros. The frontend can decide when stale is too stale to show.
-
Promise.allvsPromise.allSettledfor multi-query writes — when writes are independent and one is more critical than the others,allSettledlets the critical one succeed while the secondary one fails. The original 16 May commit hadPromise.alluntil the post-impl rubber-duck flagged it as a blocking bug. Caught. -
A revert isn't a critique of the architecture. The 14 May KV refactor was reverted in 6 minutes because of a
LOCK_TTL_S = 30line vs CF KV's hard 60s minimum. The architecture was correct. Local testing (wrangler pages dev) would have caught it in 30s. Always test platform-primitive code locally before deploy. -
Cloudflare Pages deploys can be 30 seconds, not 3–5 minutes. Observed on the 16 May fix. Don't bake the slower number into mental models — keep your verification active.
-
GHA cron has a 5-min floor with jitter. To get sub-5-min cadence, loop inside the run. We do 5 iterations × 55s = ~4:35 wall clock per run, then the next cron picks up after ~5 min. Jitter-induced gaps are bridged by the 24h KV TTL.
-
~/.copilot/secrets/<name>-admin-passwordis not necessarily THE admin password. Different sub-systems may have different credentials. Always test against an actual live endpoint before assuming.
Cross-references¶
cosmos-intelligence-playbook.md— the broader cosmos analytics system (this realtime counter is one piece of it).incident-log.md§ 13 May / 14 May / 16 May — the chronological record of each incident.deployment-playbook.md— the 19-step pre-push checklist.memory-system-architecture.md— how this playbook fits in the broader memory tiers.
Built¶
16 May 2026, alongside the permanent fix (commit 6a9f6b7b). Rubber-duck reviewed twice (plan + implementation). The architecture is the same one the 14 May KV refactor attempt would have shipped if it hadn't tripped on the CF KV TTL gotcha.