Production Incident Log¶

This page exists because of a hard-won principle:

The Growing Guardrail Rule

Every new bug found in production MUST be added as an automated check before the fix is deployed. The test suite only grows — it never shrinks. Pattern: find bug → write test that catches it → fix bug → verify test passes → deploy.

This log is the chronological record of every guardrail's origin story. Future-Copilot reads this when tempted to remove a check that "looks paranoid" — every check in the codebase has a 12-hour outage behind it.

2026¶

13 May — Realtime counter silent-zero (GA4 quota exhaustion)¶

Surface: aguidetocloud.com Site Analytics tile · cosmos.aguidetocloud.com cosmos-bar live pill · / tool-counter pill · CC dashboard "Live Now" — anywhere on the site that displays an active-user count.

Symptom: Sush noticed Site Analytics + cosmos nav both reading 0 live users for hours. Real traffic was healthy that day (469 active users, 888 page views via daily-reports API). Every realtime-counter UI was lying.

Why static checks didn't catch it: node -c syntax check + curl smoke tests pass cleanly. /api/stats?realtime=1 returns HTTP 200 with valid JSON {"active":0,"pages":[]} — no error to detect statically. The only signal was the displayed number being wrong, and only against a human's mental baseline. Site-health.yml and post-deploy-smoke.yml had no realtime-counter check.

Root cause: Two coupled bugs.

Silent 429 swallow in functions/api/stats.js::ga4RunRealtimeReport():
```
const res = await fetch(...);
return res.json();   // ← no res.ok check
```
When GA4 returned HTTP 429 — Exhausted property tokens for a project per hour, the error JSON parsed cleanly (no rows, no totals). Downstream parseInt(undefined) || 0 returned 0. No error surfaced to the response. Every consumer displayed 0.
Earlier same-day commit 9c9069f ("fix(stats): use no-dim GA4 query for realtime active count") was correct in math but DOUBLED the GA4 calls per request (1 → 2 in parallel: a no-dim total + a per-page breakdown). The endpoint had no server-side cache. Site Analytics dashboard (30s poll), CC dashboard (30s poll), tool-counter on every hero page (60s poll × N visitors) all called this endpoint independently. Combined load × doubled-per-request cost blew through GA4's per-property realtime quota (~175 reports/hr at Standard tier). The commit message read "well under our 5% daily budget" — that's correct against the project-per-day quota (1.25M tokens), but the limit that bit is property-per-hour (~1,750 tokens / 10 per report = 175 reports). Two different quota dimensions.

Fix:

Throw typed error on !res.ok in ga4RunRealtimeReport only:
```
if (!res.ok) {
  const txt = await res.text();
  const err = new Error(`GA4 realtime ${res.status}: ${txt.slice(0, 200)}`);
  err.status = res.status;
  throw err;
}
```
Left ga4RunReport permissive (used by handleCommandCentre, handleBioLinks, handleGuided — those parallel many calls and expect permissive degradation, not throws).
CF Cache API on ?realtime=1 mirroring the cosmos pattern: 120s TTL on success, 90s TTL on 429-quota errors (prevents thundering herd while quota recovers). Cosmos cache bumped 45s → 60s preemptively (shares the same property quota).
Structured error envelope in response: {active: 0, pages: [], error: 'quota-exhausted' | 'realtime-failed' | 'no-auth' | 'partial'} so future debugging surfaces cause, not silent zero.
In-isolate request coalescing — module-level realtimeInFlight = null promise dampens cold-cache stampedes within a single CF isolate. Best-effort, not global (CF fans out across isolates within a POP).
buildRealtimePayload extracted from handleRealtime and returns plain {body, cacheControl} instead of a Response, so concurrent awaiters each create their own (avoids Response-body-stream-already-consumed when multiple callers share a single Response object).

New guardrails added:

~/.copilot/scripts/realtime-probe.py — CLI probe of both endpoints with --json mode. Exit 0 healthy / 1 degraded. Joins gsc-weekly.py + stripe-weekly.py as the 3rd pillar of the Sunday drive.
aguidetocloud-revamp/.github/workflows/realtime-probe.yml — every-15-min cron. Probes both endpoints, classifies as healthy/degraded/broken, opens a realtime-degraded labelled issue on detection. Smart de-dup: one issue, comments on subsequent failures, auto-closes on recovery with a 🟢 comment. Optional ntfy.sh push on first failure if NTFY_TOPIC repo secret is set. Discovered in <15 min vs hours.
copilot-instructions.md Sunday ritual updated to include realtime-probe.py and reference the continuous GHA monitor.

Lessons:

No-res.ok-check is a class of silent-failure bug that hides behind permissive JSON parsing. Audit every fetch().then(r => r.json()) for the same pattern. Especially relevant for quota-throttled APIs (Google, Stripe, Cloudflare) where 429 returns valid JSON.
GA4 quota math has two dimensions you need to check: project-per-day (1.25M tokens) AND property-per-hour (~1,750 tokens = ~175 reports). The per-hour limit bites first on busy endpoints. Each runRealtimeReport costs 10 tokens.
CF Cache API is per-POP, not global. When sizing TTL, multiply expected GA4 load by active-POP count. With ~3-5 POPs handling NZ/AU/US English traffic, a 60s cache fetches GA4 ~3-5× per minute, not once globally. 120s TTL + ~5 POPs ≈ 60 fetches/hr per endpoint per query — under property quota with headroom.
Cache the error response, not just the success. Without caching 429s, every visitor poll re-hits the throttled API while quota recovers — self-DOS pattern. Cache the error briefly (90s) so the recovery window doesn't get drowned by your own clients.
The growing-guardrail rule scales to non-test surfaces. Guided has test-guided-qa.cjs. Cosmos has qa-audit.mjs. The main site now has the CLI probe + GHA monitor as its equivalent. The pattern: find bug → write the check that would have caught it → ship both.

Commits:

Pre-incident: 9c9069f (13 May 2026 morning, "fix(stats): use no-dim GA4 query for realtime active count") — well-intentioned doubling of GA4 calls.
be9e501e (13 May 2026 evening) — res.ok guard + CF cache + structured errors.
2526ab7f (13 May 2026 evening) — in-isolate coalescing + GHA monitor.

Cross-reference: GSC + Stripe Ownership Rules (now extended to include realtime probe).

14 May — Half-fix completed (cosmos error-cache) + KV refactor attempt failed¶

Same outage recurred next morning. Reason: yesterday's evening fix hardened handleRealtime but handleRealtimeCosmos still returned Cache-Control: no-cache on errors. The two endpoints SHARE the GA4 property quota, so cosmos's no-cache thundering herd kept the property starved → handleRealtime (which had correct error caching) could never recover.

Two attempts the next evening:

KV-backed shared cache refactor (9c132f4c, 14 May) — replaced per-POP caches.default with COSMOS_SUMMARY_KV for global cross-POP cache + single-flight via KV lock + stale-while-revalidate via ctx.waitUntil. Rubber-ducked before shipping. BROKE PROD within 3 min: every request returned error: 'warming-up', no recovery. Reverted in 305a3985.
Surgical fix (7dda5840, 14 May) — mirror the proven handleRealtime pattern in handleRealtimeCosmos: classify status (429 vs other), cache error response 90s/30s. +11/-4 lines. Worked first try.

Root cause of the refactor break: CF KV expirationTtl has a hard 60-second minimum. I used LOCK_TTL_S = 30. Every put() for the lock threw, my catch returned false, every request thought the lock was held → infinite warming-up loop. No successful KV writes ever happened.

The kicker: sister file functions/api/cosmos-summary.js literally uses const LOCK_TTL_S = 120; two lines from the top. The 60s minimum was sitting in the codebase the whole time.

Lessons:

CF platform gotchas need a dedicated quickref. KV expirationTtl ≥ 60s minimum, KV read edge-cache up to 60s, ctx.waitUntil timeout bounds, isolate eviction — none of these are visible in code review. The rubber-duck flagged "KV-as-lock is risky" but didn't name the 60s rule specifically. Read CF docs whenever you touch platform primitives.
Local testing is non-negotiable for Pages Function refactors. wrangler pages dev (or equivalent local CF runtime) would have caught the TTL bug in 30 seconds. A speculative deploy to production is not a test.
When prod is healthy, prefer surgical over architectural. The actual cause of the 13 May outage was a single missing line of error caching in one handler. The per-POP fanout concern, while real, is theoretical at current traffic. Don't let "while I'm in here" turn a 15-line fix into a 375-line refactor.
Promise discipline: the agent told Sush "I'll have it shipped by the time you're back from yoga" while he could not autonomously work between user messages. Don't imply background autonomy. Either ship in the same turn or be honest about the wait.

Surgical fix commit: 7dda5840 (14 May 2026, fix(stats): cache cosmos realtime error responses).

Reverted attempt: 9c132f4c (KV refactor) → 305a3985 (revert). Architecture still sound — would work with LOCK_TTL_S ≥ 60s + local testing. Parked, not abandoned.

16 May — Permanent fix (Option A+): decouple public realtime from GA4 quota¶

The 4th incident in 4 days. Same root cause as 13–14 May (visitor polling × per-POP cache × GA4 property quota), but this time it ran long enough to exhaust both the per-hour AND the per-day GA4 quotas. New error message: "Exhausted property tokens per day" — different ceiling, different reset window (midnight Pacific Time, ~07:00 UTC, ~19:00 NZST).

Why bandaids 1–4 weren't going to be enough: the previous fixes all tried to make the per-visitor → per-POP → GA4 call pattern cheaper or safer. The pattern itself was the bug. With ~3–5 active POPs and any non-trivial traffic, the property quota burn was guaranteed. Always was.

The permanent fix — Option A+ (6a9f6b7b, 16 May 2026):

NEW handleRealtimeRefresh (POST /api/stats?refresh=realtime, authed) — THE ONLY caller of GA4 Realtime for public-counter data. Writes to env.COSMOS_SUMMARY_KV with 24h TTL. Uses Promise.allSettled so a per-page query failure can't block the core realtime:active write.
REWRITE handleRealtime + handleRealtimeCosmos to READ from KV only. Fail closed — no fallback to GA4. Stale-but-real ≫ fake-zeros. Errors written to separate realtime:last_error key — NEVER overwrites last-good.
NEW .github/workflows/realtime-refresh.yml — cron */5 * * * *, loops 5× with 55s sleeps for ~60s effective cadence. concurrency group prevents overlapping runs. Auth via existing ADMIN_PASSWORD GHA secret (no new secret to manage).
NEW handleRealtimeCosmosIntel — admin path (?intel=1) still calls GA4 directly for byPlanet breakdown, but reuses KV total when fresh to halve GA4 calls.
DELETED: realtimeInFlight in-isolate coalescing (no longer needed), buildRealtimePayload (folded), 90s error caching on public paths (replaced by "fail closed").

Result: GA4 calls/hour drops from O(visitors × poll_rate × N_pops) to CONSTANT ~60. Property quota becomes irrelevant regardless of traffic spikes.

Rubber-duck contributions:

Flagged that fixing only cosmos leaves ?realtime=1 (Site Analytics) as the next quota burner → migrated both endpoints in the same PR.
Flagged that the public path must fail closed, never fall back to GA4 → prevented recreating the failure mode.
Flagged that realtime:active and realtime:pages should be written with Promise.allSettled, not Promise.all → a per-page 429 won't take down the core counter. Caught a blocker before push.
Flagged that last-good keys need long TTL (24h not 60s) → survives missed cron runs.
Flagged that errors must go to a SEPARATE key → stale-but-real always wins over zeros.

Verification at deploy time:

✅ Cloudflare Pages deploy was 30 seconds (not the typical 3–5 min — keep mental models active).
✅ Unauthed POST returns {"ok":false,"error":"Unauthorized"} — proves new code is live.
✅ Manual GHA workflow run authed correctly (Bearer ADMIN_PASSWORD → 200).
⏳ KV pre-warming blocked on GA4 daily quota refill (~19:00 NZST). The cron will auto-recover on its first successful attempt after the quota window resets. No further action needed.

New gotcha (own its own row in the file map): ~/.copilot/secrets/guided-admin-password is NOT the CC admin password. Different sub-system, different SHA-256 hash. Verified by computing SHA-256 of the file content and comparing to the embedded hash in layouts/cc/list.html line 538 — they don't match. Future sessions must use the CC admin password (whose SHA matches 0579d11899d8171a96c04302aa2f2f7250adfce591e1a118f332f40c70027be8) for any isAuthedAsAdmin endpoint. Sush set the GHA secret manually via web UI.

Lessons (separate from the architecture lessons — those live in the playbook):

Quota errors have multiple dimensions. GA4 per-hour bites first on busy endpoints; GA4 per-day bites only after sustained abuse. Always read the actual error message — "per hour" ≠ "per day", different reset windows, different reset times.
A revert isn't a critique of the architecture. The 14 May KV refactor revert was a 30-line bug fix away from being correct. The 16 May fix is the same architecture with the lesson applied (long TTL + simpler scheduler design via GHA cron, sidestepping the KV-lock-races entirely).
CF Pages deploys can be 30 seconds. Don't assume the slower 3–5 min. Actively verify with curl between commits.
Sub-1-min refresh cadence with GHA cron requires a loop inside the run, because GHA cron has a 5-min floor with jitter. 5×55s = ~4:35 wall-clock, picks up after next cron at ~5 min mark.
Promise discipline (carried from 14 May): the agent must never imply background autonomy between user messages. Either ship in the same turn or be honest about waiting for a quota window.

Full operational playbook: realtime-counter-playbook.md — architecture, file map, runbook ("the pill is broken / showing nothing again" → step-by-step), and the why-not for re-introducing any of the reverted bandaids.

Commit: 6a9f6b7b (16 May 2026, feat(stats): decouple public realtime from GA4 quota via scheduled KV refresh).

10 May — Plain AI mobile-collapse blocked by CSP in production (false-green local)¶

Surface: plainai.aguidetocloud.com/ mobile homepage (commit 1366cf7, 10 May 2026)

Symptom: First pass at the homepage mobile-collapse used <details open> plus an inline <script> that ran during HTML parsing to remove the open attribute on mobile (document.currentScript.parentElement.removeAttribute('open')). Local npm run qa was 14/14 green. Pushed (1366cf7). Ran npm run qa:live after CF Pages deploy: 2 failures — 7 cat-sections visible on iPhone (expected 1), collapse open by default (expected closed). Mobile homepage rendered the entire 78-topic wall instead of the trimmed cold-start view.

Why static checks didn't catch it: python -m http.server (used locally by npm run qa's auto-spun server) doesn't send Content-Security-Policy headers. Inline <script> tags ran fine in the local browser. The CSP only fires in production, where Cloudflare Pages serves the _headers file. The local QA was a false-green for any CSP-affected feature.

Root cause: public/_headers contains the strict script-src 'self' https://cosmos.aguidetocloud.com directive — no 'unsafe-inline'. Browsers silently block inline <script> tags under that policy. The HTML parser still inserts the script into the DOM, but the script never executes. So the close-on-mobile logic never ran, and <details open> stayed open.

Fix: Refactored to a CSS-only checkbox + label trick (commit 59d8529):

Hidden <input type="checkbox" id="cat-collapse-toggle"> (visually-hidden, keyboard-focusable as fallback)
<label for="cat-collapse-toggle" role="button"> as the visible toggle
<div class="cat-collapse-mobile-content"> wraps the 6 sibling sections
Mobile (<720px): content display: none by default; :checked ~ content { display: block }
Desktop (≥720px): toggle + label display: none; content uses display: contents so 6 sections render flat under <main> exactly as before

No JS, no CSP risk, layout-transparent on desktop, native CSS toggle on mobile, accessible via label-for-checkbox keyboard pattern. Verified live: 14/14 ALL CHECKS PASS.

New guardrails added:

scripts/qa.mjs lesson-1 + mobile-collapse checks now assert the post-fix invariants — .cat-collapse-mobile-content display: contents on desktop, :checked is false by default on mobile, only 1 cat-section visible by checkVisibility() on mobile (rest collapsed via the :checked ~ content rule). The growing-QA suite caught the bug in qa:live; the assertions are now permanent so future regressions fail in qa (local) too — the suite's coverage of CSP-sensitive code paths grew.
Documented in cosmos/plain-ai/curriculum-build.md (CSP rule section) — comprehensive playbook of what to use instead of inline scripts (CSS-only checkbox+label, external templates/<feature>.js files served from 'self').
Permanent rule: any solution involving inline <script> on Plain AI is a non-starter. Even when local QA is green. Future sessions: don't repeat this.

Lesson: Local QA without CSP is a false-clear for anything inline-script-shaped. npm run qa:live is the ONLY place the production CSP is exercised against your build output — always run it after deploy, never skip it. Also: prefer CSS-only patterns over JS where possible, especially for interactive UI primitives like collapse / toggle / disclose. The CSS :checked pattern works in every browser back to IE9, has no CSP exposure, and is keyboard-accessible without ARIA gymnastics.

8 May — Guided homepage v1 SampleExam focus mode broken (transparent overlay)¶

Surface: aguidetocloud.com/guided/ (Guided homepage v1, shipped 7 May 2026)

Symptom: After v1 deploy, Sush tested the new "Try a sample question" component overnight. Reported next morning: "the focus mode breaks". When the user clicked the focus toggle, the component DID enter focus state (CSS class added, position: fixed), but the overlay was transparent — the rest of the page bled through. Reading the quiz was impossible.

Why static checks didn't catch it: test-guided-home.cjs had a check that focus mode entered (is-focus class added) and that body scroll was locked. Both passed. The test did NOT verify the computed background colour of the overlay. The visual diagnostic that was added 8 May caught it: background-color: rgba(0, 0, 0, 0).

Root cause: <SampleExam> CSS used background: var(--base); for the focus overlay. But this design system uses --color-base (defined in src/styles/global.css), not --base. CSS treats undefined variables as transparent — silent fail.

Fix: Sush asked for a better approach instead of fixing the SampleExam: replace it entirely with the embedded real <PracticeQuiz> component (same one cert pages use). PracticeQuiz has its own working CSS focus mode (uses correct token bg-base utility class) and the existing free tier (FREE_QUESTION_LIMIT = 20) covers the "20 free questions" cap natively. v2 deletes SampleExam entirely.

New guardrails added:

test-guided-home.cjs rewritten for the embedded PracticeQuiz (18 checks). Checks the embed renders, AZ-900 cert code visible, surface-card wrapper provides container styling, no quiz load error, mobile responsive.
Visual diagnostic principle: when adding a CSS overlay, verify computed background-color is NOT rgba(0, 0, 0, 0). Adding to the home QA loop as a future enhancement.
Token verification: before using var(--something), grep the design system for the exact token name. Don't assume --base exists — confirm in src/styles/global.css.

Lesson: Tests that verify "the overlay class was added" don't verify "the overlay is visible". A passing test on a transparent overlay is a false-clear. Computed-style checks > class-name checks for visual properties.

Better approach lesson: when a real component exists for a similar purpose (PracticeQuiz for "show a quiz"), embed don't fork. The custom SampleExam was ~700 lines of mockup code that reproduced PracticeQuiz badly. The embed is one line of JSX.

Commits: - 0ca214c (7 May) — v1 ship with buggy SampleExam - 0e128e7 (8 May) — v2 ship: deleted SampleExam, embedded PracticeQuiz, ~700 lines net deletion

Cross-reference: Guided Homepage Design, Guided Brand Assets

Surface: plainai.aguidetocloud.com (Plain AI)

Symptom: After tightening the Plain AI masthead to a single row at <=600px, audit data reported overflow=0 and sameRow=true. But the visual was broken: the link "What changed" was breaking across two lines as "What" / "changed" inside its flex item, and the bubble brand mark sat visually fused to "Topics" with no separation.

Why static checks didn't catch it: The audit tested wrap by comparing offsetHeight > 32 to detect 2-line links. But min-height: 44px (set for tap targets) inflated EVERY link's bounding box to 44 regardless of actual line count. Detector returned false negatives.

Root cause: .masthead-link had no white-space: nowrap. Multi-word labels wrap by default in flex items when narrower than text width.

Fix:

white-space: nowrap on .masthead-link — keeps each link on one line
Hairline border-left + padding-left on .primary-nav — visual divider between bubble brand and nav
Removed the overflow-x: auto trick that was silently clipping "About" — see the next entry below in this log
Hide "About" via :nth-child(4) at <=480px — the 4 nav items + bubble + search + theme don't all fit at narrow phones; About is recoverable from footer

New guardrail: When detecting wrap programmatically, compare offsetHeight to scrollHeight instead of using a height threshold (the threshold breaks under min-height rules). Or measure natural single-line height first.

Lesson: Visual review > audit data. Audit booleans like sameRow=true can be technically true while looking broken. Always inspect screenshots at multiple viewport widths, not just numerical metrics.

Cross-reference: Cosmos Nav Playbook → Common gotchas (gotcha #3 — white-space defaults)

7 May — `overflow-x:auto` silently clips nav items¶

Surface: plainai.aguidetocloud.com (Plain AI)

Symptom: First attempt at single-row mobile masthead used overflow-x: auto on .masthead-actions so the nav could scroll horizontally if it didn't fit. Audit reported overflow=0 (because clipped content didn't extend documentElement.scrollWidth). But the LAST nav link "About" was off-screen with NO visible scrollbar (because scrollbar-width: none had hidden it). Users saw 3 of 4 links and couldn't reach the 4th.

Why static checks didn't catch it: documentElement.scrollWidth measures the WHOLE page. Content clipped INSIDE a overflow-x:auto container doesn't contribute. The audit thought the layout was clean.

Root cause: overflow-x: auto + invisible scrollbar = silent content amputation.

Fix:

Removed overflow-x: auto from .masthead-actions
Replaced with explicit display: none on .primary-nav .masthead-link:nth-child(4) at <=480px — drops the link visibly in DOM order, recoverable from footer
Added flex-wrap: nowrap on the parent so other items don't wrap unexpectedly

New guardrail: Don't use overflow-x:auto to "hide" content. Either let it overflow visibly (so users can see what they're missing) OR explicitly control which items render via display: none. The audit script now also checks el.scrollWidth > el.clientWidth per-element to catch container-clipped overflow.

Lesson: "It fits" can mean "looks like it fits while content is silently amputated." Inspect for visible content, not just dimensions.

Cross-reference: Cosmos Nav Playbook → Common gotchas (gotcha #2 — overflow-x:auto)

6 May — Phantom `// other wires` label after layout move¶

Surface: shift.aguidetocloud.com (Shift)

Symptom: Mobile audit screenshots showed cosmos chips stacking VERTICALLY beside the brand on Shift, with a phantom // OTHER WIRES mono-caps label dangling above them. At narrow widths (375/414) the chips overlapped with the brand text. Looked chaotic.

Root cause: When I moved Shift's cosmos rail FROM end-of-masthead-nav INTO .brand-group, the legacy @media (max-width: 860px) { .cosmos-rail::before { content: "// other wires"; ... } } rule kept firing. That rule was for the OLD layout where cosmos sat below stacked nav links and needed a label. Inside .brand-group it was nonsense — but the selector still matched, so the pseudo-element rendered.

Plus the same media block had flex-wrap: wrap; border-top: 1px solid rules that forced the chips into a vertical stack inside the cramped brand-group cell.

Fix: Removed the entire ::before content block AND the wrap/border rules that referenced the old layout. New rule: .brand-group .cosmos-rail { display: none } at <=860px — Shift's primary surface on mobile is the role lookup, not cross-planet nav.

New guardrail: When moving a CSS-styled element to a new parent, search the entire stylesheet for ALL rules selecting it (selectors with that class, attribute, or descendant relationship). Either delete or rewrite them. New rules don't deactivate old rules — both will fire if they match.

Lesson: Layout changes have a CSS half-life. The old layout's media-query rules don't auto-archive; you have to find and remove them. grep for the class name across all media queries before adding new ones.

Cross-reference: Cosmos Nav Playbook → Common gotchas (gotcha #1 — stale CSS from previous layout)

6 May — Shift `.tape` marquee leaks 60px past viewport¶

Surface: shift.aguidetocloud.com (Shift)

Symptom: After fixing the cosmos-rail visual mess on Shift mobile, the audit STILL reported 60px document overflow at 375px. Visually, the masthead looked clean. The hidden culprit was OFF-SCREEN.

Root cause: The .tape ticker element ("HEAVY AI PRESSURE · ACCOUNTANT…" scrolling marquee) had no overflow: hidden. Its inner .tape-content was 9700px wide (full role list × 2 for the seamless loop) with padding-left: 100% + transform: translateX(-50%) for the scroll animation. Without overflow:hidden on the parent, that 9700px width contributed to document.documentElement.scrollWidth. Pre-existing bug — the ticker has been leaking since Shift launched. Surfaced by the post-nav-rework mobile audit.

Fix: .tape { overflow: hidden } — clips the inner content to the visible band. Animation still works; document no longer overflows.

New guardrail: Audit script now drills into elements where right > viewport + 1 OR width > viewport + 1, not just total document width. Catches off-screen leakage from clipped marquees, decorative absolute-positioned elements, etc.

Lesson: Marquee tickers ALWAYS need overflow:hidden on the container. Default to it from day one.

Cross-reference: Cosmos Nav Playbook → Common gotchas (gotcha #5 — marquee tickers)

6 May — Shift footer `.byline` + `.masthead-foot-meta` force horizontal scroll¶

Surface: shift.aguidetocloud.com (Shift)

Symptom: After fixing the tape marquee, Shift mobile STILL had 60px overflow at 375. Drill-down revealed it was the FOOTER, not the masthead. The .byline text "— THE AI JOB-CHANGE WIRE — VOL. 1 · NO. 19" rendered as a single 401px-wide line. The .masthead-foot-meta row had 5 footer links separated by · with no flex-wrap, forcing 411px width.

Root cause:

.byline had no overflow-wrap — long em-dash phrases don't break at em-dashes by default
.masthead-foot-meta had display: flex but no flex-wrap: wrap — items extended beyond container width

Fix:

.byline { overflow-wrap: anywhere } — em-dashes and long words can break to fit
.masthead-foot-meta { flex-wrap: wrap } — footer links wrap onto multiple lines on narrow viewports

Lesson: Don't assume only the masthead causes mobile overflow. Audit the WHOLE page at narrow widths — footer, side panels, decorative content all contribute.

6 May — Moon `/explore/` cert grid: 37px overflow from long unbreakable cert slugs¶

Surface: aguidetocloud.com/guided/explore/ (Moon)

Symptom: Moon's cert grid at 375px viewport was 37px wider than viewport. Cards like paloalto-netsec-professional and hashicorp-terraform-associate had unbreakable slug strings forcing the 2-column grid track wider than viewport.

Root cause: CSS Grid auto-tracks (e.g., grid-template-columns: 1fr 1fr) respect each item's content minimum width. A long unbreakable string inside a grid item PUSHES the track wider, and if the word exceeds the cell width, the track keeps growing. The longest single card extended to 412px in a 375 viewport.

Fix:

@media (max-width: 640px) {
  .zt-lic-card { min-width: 0; }    /* allow grid track to shrink below content min-width */
  .zt-lic-card-code, .zt-lic-card-name {
    overflow-wrap: anywhere;
    word-break: break-word;
  }
}

min-width: 0 is the key — it tells flexbox/grid that the item is allowed to be smaller than its content's natural minimum width. Without it, the grid track stretches to fit the longest unbreakable word.

Lesson: Anywhere CSS Grid or Flexbox holds user-generated content (cert names, file paths, URLs), apply min-width: 0 + overflow-wrap: anywhere defensively.

Cross-reference: Cosmos Nav Playbook → Common gotchas (gotcha #4 — CSS Grid tracks expand)

6 May — PowerShell `Invoke-WebRequest` returns false 500 errors¶

Surface: SLA verification scripts

Symptom: Mid-session SLA check via Invoke-WebRequest (the alias curl resolves to in PowerShell) returned HTTP 500 on https://www.aguidetocloud.com/guided/data/questions/az-900.json AND on the practice page URL. Looked like the paid product was down. Almost triggered a revert.

Why I almost panicked: The 500 errors looked real. The error message was "Response status code does not indicate success: 500 (Internal Server Error)." in PowerShell.

Root cause: PowerShell's Invoke-WebRequest has TLS handshake or User-Agent quirks that fail on certain Cloudflare-fronted endpoints. Direct curl.exe (the actual binary, not the alias) returned all green: 200 with valid JSON, 200 on practice page, valid Stripe URL on checkout API.

Fix: Always invoke curl.exe explicitly (full name, with extension). The bare curl in PowerShell maps to the alias.

New guardrail: SLA scripts in this codebase MUST use curl.exe. Never use Invoke-WebRequest for production verification.

Lesson: When a SLA check goes red unexpectedly, before panicking — verify with a SECOND tool. The check tool itself can be wrong.

6 May — Hugo's HTML minifier strips attribute quotes; regex tests false-flag¶

Surface: Production verification scripts after Hugo deploys

Symptom: After deploying Brain Bar with the new .bb-header-left wrapper, the verification regex class="bb-header-left" reported 0 matches. Looked like the deploy was broken or the old version was still cached.

Root cause: Hugo's HTML minifier strips quotes around attribute values when they're safe (no whitespace, no special chars). class="bb-header-left" becomes class=bb-header-left. My regex was too strict.

Fix: Use lenient regexes that accept both quoted and unquoted forms: class=["]?bb-header-left. Or query the live HTML's DOM via Playwright instead of string-matching.

Lesson: Don't assume Hugo (or any minifier) preserves source attribute formatting. Build verifications against the MINIFIED output, not the source.

6 May — Tap targets across all planets were 36×36, not 44×44¶

Surface: All 5 planets

Symptom: Mobile audit flagged dozens of tap targets under the iOS 44×44 standard: - Earth/Moon: .site-logo-img-wrap (36×36), .theme-toggle (36×36) - Brain Bar: .bb-theme-toggle (39×33) - Shift: cosmos chips (36×36), hamburger ☰ (39×32), theme toggle ◐ (36×27) - Plain AI: cosmos chips (32×36), search ⌕ (28×40)

Why: The 36×36 was inherited from earlier visual design choices when the chips/buttons were borders-and-padding-driven. WCAG 2.2 AA passes at 24×24, so these all "passed accessibility" — but iOS HIG calls for 44×44 and that's the better target for thumb taps on phones.

Fix (one batch across 3 repos): Bumped every nav-area touchable to min-width: 44px; min-height: 44px. The visible icon stays its intrinsic size (24px or 30×24); the wrapper padding extends the tap area to 44.

New guardrail: Audit script flags any <a>, <button> in header / .site-nav / .bb-header / .masthead with clientWidth < 44 AND clientHeight < 44. Run before every nav-touching commit.

Lesson: WCAG ≠ iOS HIG. Pass both. The bigger spec is the better target.

5 May — Cloudflare `_redirects` silently not honoured¶

Surface: aguidetocloud.com (Hugo)

Symptom: Created static/_redirects with standard Cloudflare 301 syntax to handle the dead-page redirects from the May 2026 site cleanup (see Site Audit). After Cloudflare Pages deployed at the correct SHA (verified via gh api, deploy "completed: success"), all redirect rules returned 404 instead of 301, including a deliberately-added /redirect-test/ diagnostic rule. No build error, no deploy failure — _redirects was just silently ignored.

Why static checks didn't catch it: Hugo build was clean, public/_redirects was 1387 bytes locally, file format was textbook Cloudflare syntax. The only signal something was wrong was the missing 301 in the curl -sI response — which is exactly the verification any redirect work needs anyway.

Root cause: Unknown — multiple variants tried (CRLF→LF, UTF-8→ASCII, comments→no comments, static/_redirects→repo root). All failed identically across production AND deploy preview URL (<deploy-id>.aguidetocloud-revamp.pages.dev). Possibilities (none verified): build-output-dir setting in CF dashboard, functions/ directory routing conflict, undocumented CF parser behaviour.

Fix: Switched to Hugo's built-in aliases: frontmatter. Each destination page (content/cloud-labs/_index.md, content/blog/_index.md, content/service-health/_index.md, content/_index.md) added an aliases: list. Hugo generates static <meta http-equiv=refresh> + <link rel=canonical> HTML at the old paths. Returns 200, browser follows, canonical preserves SEO. Verified live across all 5 dead URLs.

New guardrail: When a redirect is required on aguidetocloud.com, default to Hugo aliases, not _redirects. Pattern documented in Site Audit → The Cloudflare _redirects gotcha. The static/_redirects file is kept in repo as a backup for if CF parser ever starts honouring it (would then issue proper 301s, served before static files — no harm in coexistence).

Lesson: Don't trust silent success from third-party platforms. Always verify URL-level behaviour with curl -sI after a redirect-related change, regardless of how clean the build looks. The cf-cache-status: DYNAMIC header was misleading too — it suggested no caching, but production was still serving an older version of /start/ for a while after deploy due to a deeper edge layer.

3 May — Cert-landing ↔ practice page checkout LOOP¶

Surface: Paid product (guided.aguidetocloud.com/{cert}/practice)

Symptom: Clicking "Unlock — $9" on the practice page navigated to /guided/{cert}/?checkout=1 (the cert landing page). Cert landing's slug-extraction was buggy on 2-segment URLs so checkout silently failed. Clicking again on cert landing routed back to practice. Paid product unclickable.

Why static SLA checks didn't catch it: All three SLA curl checks passed (questions JSON 200, practice 200, checkout API returns Stripe URL). But the user-facing click flow was broken.

Root cause: Both pages used <a href="?checkout=1"> for the unlock button. Cert landing's URL handler had a slug !== 'guided' guard but practice didn't. The slug-extraction bug compounded.

Fix: - Replaced both ?checkout=1 URL-nav patterns with direct <button> + inline <script> calling /guided/api/checkout and redirecting to Stripe. - Cert landing must contain id="cert-unlock-btn" (button, not <a>). - Practice page must contain id="practice-unlock-btn" (button, not <a>). - Neither page may have <a href="/guided/{cert}/?checkout=1"> or <a href="/guided/{cert}/practice/?checkout=1"> (the loop URLs).

New guardrail: testCheckoutFlow() in test-guided-qa.cjs — intercepts the API call, verifies the certCode payload, asserts no loop-href regex matches in the page HTML.

Lesson: Static-load tests cannot catch interaction bugs. Always add a Playwright click test for any new CTA on the paid product.

Cross-reference: Full session context in May 2026 Zen Shipping Log (see "P0 — Cert landing ↔ practice page checkout LOOP" section).

2 May — Hugo↔Astro nav drift (One Body Two Organs)¶

Surface: Public site nav (aguidetocloud.com and guided.aguidetocloud.com)

Symptom: Hugo nav got underline-active + Mind Maps + group dividers + ko-fi CTA in commit f3dda5c. Astro Header.astro was overlooked. Guided pages still showed the old rectangle pill and were missing Mind Maps for ~14 hours until 3 May session caught it.

Root cause: Single-organ thinking. The session believed it had "shipped the nav update" because the Hugo file was committed. The Astro pair was never opened.

Fix: Updated Header.astro with the same changes (translated CSS variable names from Hugo to Astro prefix conventions).

New guardrail: Pre-commit gate — before git push on any pair file, ask: "Did I touch the matching organ?" If no → DO NOT push. The full parity-file map is now in Zen Quickref.

Lesson: Hugo and Astro are one product, two implementations. Treat them as a single deploy unit.

2 May — Cross-session WIP contamination¶

Surface: Public site

Symptom: A homepage/nav session unintentionally shipped 2 WIP files from a parallel mind-maps session. Code that wasn't ready went live.

Root cause: git stash pop left files staged in the index. The next session did git add <paths> for its own files but didn't realise other files were already staged. git diff --staged --stat showed extra files, but the session assumed "stash pop must have grabbed extra" and committed anyway.

Fix: Reverted the bad commit. Re-staged only intended files.

New guardrail: Parallel-Safe Git Rules — never git add . or -A. Always explicit paths. Always read git status -s for M (capital M, leading space = staged by something else).

Lesson: In a multi-session workflow, the index is shared state. Treat it like a database connection — assume someone else might have written to it.

30 April — Practice exam outage (12 hours)¶

Surface: Paid product

Symptom: Practice exams stopped loading entirely. Three bugs shipped together: 1. Wrong URL path (/api/ vs /data/) 2. React Rules of Hooks violation (early return before useState/useMemo) 3. Untested SSR output

Root cause: A CC dashboard session changed question loading from inline props to runtime fetch without testing the SSR'd output. The early-return-before-hooks pattern is silently fatal in React.

Damage: ~12 hours of broken practice exams. Refund risk. Reputation damage.

Fix: Reverted the offending change. Implemented a different fetch pattern.

New guardrails: - The whole ~/.copilot/copilot-instructions.md § Practice Exam SLA section. - React Rules of Hooks codified: NEVER put early return before useState/useMemo/useEffect. All conditional returns go AFTER all hooks. - Build output check: after npm run build, verify (a) dist/data/questions/ has JSON files, (b) practice HTML page is <50KB (not 1MB — means data is being inlined again), (c) dataUrl in HTML points to /guided/data/questions/, NOT /guided/api/questions/. - Post-deploy curl check: curl -s https://www.aguidetocloud.com/guided/data/questions/az-900.json | head -c 100 must return JSON, not HTML. - "Revert first, investigate second" principle.

Lesson: Paid product = SLA. Don't debug on production. Revert, get it green, then investigate offline.

24 April — Practice quiz options not rendering (string format)¶

Surface: Paid product (ab-100 and similar certs)

Symptom: Practice quiz options had empty text and clicking one selected all of them.

Root cause: Question options can be string[] ("a. text") or Option[] ({id, text}). The normalizeQuestion() function handled both, but it was being skipped when perfData was empty (early return path), so string-format options got passed through raw.

Fix: Removed the early-return path. Made normalizeQuestion() always run.

New guardrail: test-guided-qa.cjs checks option text rendering on both ab-100 (string format) AND ab-730 (object format).

Lesson: When supporting two data formats, test on both — explicitly. Don't trust that "the code path looks like it handles both".

April — 16 tools shipped without FAQPage JSON-LD¶

Surface: Public site SEO

Symptom: During an SEO audit, 16 tools that had FAQ frontmatter were missing FAQPage JSON-LD. No rich snippets in Google.

Root cause: The new-tool checklist didn't include FAQPage JSON-LD. Each tool's template was a copy of an earlier tool, and the earlier tool didn't have the JSON-LD either.

Fix: Batch update of all 16 templates.

New guardrail: Tool Integration Checklist step 11 — FAQPage JSON-LD in template, mandatory for every tool with FAQ frontmatter.

Lesson: Copy-paste templates propagate omissions. Audit the whole class periodically.

April — Tool registered, shipped invisible¶

Surface: Public site

Symptom: A tool was added to data/toolkit_nav.toml but was missing from tool-ecosystem.html. The homepage card existed; the ecosystem strip didn't include it.

Root cause: Single-file thinking. The session thought "I added it to the registry, done."

Fix: Added to tool-ecosystem.html.

New guardrail: Tool Integration Checklist step 14 — full-site audit covers all 7 registry points + ecosystem groups + nav counts + preview images.

Lesson: "Registered" ≠ "shipped". Visibility is a separate concern from data presence.

20 May — Hardcoded tool counter badges removed (honesty rule + dead pipeline)¶

Surface: Every tool page on aguidetocloud.com — the small pill badge next to the title showing "X articles read / X prompts polished / X plans compared" etc. Rendered by layouts/partials/tool-header.html (Zen variant) and layouts/partials/tool-hero.html (older variant). 57 tools had entries in data/tool_counters.toml.

Symptom: Sush asked for an investigation of all tool counters — were they live or hardcoded? Investigation surfaced three independent counters on the site (full breakdown in the Realtime Counter Playbook § "DON'T CONFUSE"):

The "X reads" tool badge — claimed to be live, was hardcoded seed numbers in TOML.
Blog post "X reads" — synthetic formula (days × 8) + (words / 50), never real.
"🟢 X reading now" pill in nav/headers — genuinely live via GA4 Realtime + KV cache + cron worker.

User asked to remove #1 only. #2 and #3 untouched.

Why no static check would have caught it: the badge rendered perfectly. HTML validated, JS ran without errors, the animation looked credible. The bug was semantic — a number that should mean "this many people did X" actually meant "Sush typed this number into a TOML file on 13 April 2026."

Root causes (compounding):

Honesty rule violation. voice-and-tone.md explicitly says "reject self-stat blocks; use voices (testimonials with real handles)." Yet the entire data/tool_counters.toml system existed to display static numbers as if they were live usage data.
Refresh pipeline silently broken since at least 1 May 2026. .github/workflows/refresh-counters.yml ran monthly and was supposed to update the TOML from GA4. The script scripts/refresh-counters.py had the wrong GA4 property ID as default (270121818 — an old/unused property). The real site property used by functions/api/stats.js and functions/api/cosmos-summary.js is 530486519. The workflow's env: block did NOT pass GA4_PROPERTY_ID, so the broken default ran. 1 May 2026 run returned 0 page paths; validation guard correctly aborted; nobody noticed.
Coverage gap. The TOML had 57 [[tools]] entries. The refresh script's TOOL_PATHS dict only covered 28 of them. Even fixing the property ID would leave Mind Maps, Agent 365 Planner, all the games, and most of the calculator tools frozen at the floor of max(views, 50) forever — because the script doesn't know they exist.
JS comment lied about behaviour. static/js/tool-counter.js had a header comment claiming "JS animates the count on scroll and increments on user actions". No code in the repo POSTs an increment on click/copy/submit anywhere. The second clause was wishful documentation.
Past-session belief propagation. The 19 May 2026 Instruct Builder v5 polish session left a journal note (finding #6) saying "counter '120 instructions generated' is real... GA4-seeded across all tools. Removing for one tool would break site-wide consistency. Kept." — this was wrong on both clauses but had been accepted by a future session as authoritative. Lesson: journal claims should be re-verified against code, not inherited.

Fix (commit 3e514708, 20 May 2026):

data/tool_counters.toml → emptied to explicit tools = [] with a multi-line header comment explaining why. The file is kept, not deleted — Hugo's (index hugo.Data "tool_counters").tools chain on a missing TOML can be brittle in templates; an explicit empty array is safer than betting on nil.tools behaviour. Both partials already guard the render with {{ if gt $counterBase 0 }}, so empty array = range iterates zero times = $counterBase stays 0 = guard prevents render. No template change needed.
.github/workflows/refresh-counters.yml — removed the monthly schedule: cron trigger. Kept workflow_dispatch but gated the job on an explicit confirm_resurrect == 'true' input. Header comment marks the file DISABLED 2026-05-20. A manual run cannot silently regenerate the TOML — someone has to type true into the dispatch dialog.
NOT touched (deliberately):
tool-counter.js — its other code paths (freshness badge + live "🟢 reading now" pill) still work. The badge-animation block (if (el) { ... }) is now a safe no-op when no .tool-counter element renders.
tool-header.html / tool-hero.html partials — kept the data-lookup logic. They become no-ops naturally via the existing guard.
CSS for .tool-counter in static/css/style.css + static/css/zt-reading.css — orphan but harmless.
scripts/refresh-counters.py — orphan script left alone.
single.html blog post counter — independent code path, user said tools only.

Verification (post-deploy, live site):

tool-counter element gone from /prompts/, /ai-news/, /prompt-polisher/, /m365-roadmap/, /cert-tracker/roadmap/ (both partial paths covered).
Header structure (Focus Mode button + Ko-fi button) still renders cleanly; flex layout reflows naturally with no gap collapse.
/api/stats?realtime=1 returns active: 23 ✓ (live counter system untouched).
/api/stats?realtime=cosmos returns fresh data with age_s: 12 ✓.
/guided/data/questions/az-900.json returns 200 ✓ (SLA-protected practice exams unaffected).

Guardrail / lessons:

"Live" claims must include refresh evidence. When a file/system claims to be "auto-refreshed from X" (here: GA4), the next session that touches it must verify the refresh actually ran in the last expected window. Look at the workflow run history, not the file header comment.
Validation guards must escalate, not just abort. The validate_results() function in scripts/refresh-counters.py correctly refused to overwrite the TOML with 0-data, but the workflow's only signal of failure was a red dot in Actions that nobody watched. A failed validation that points at a critical data file should open a realtime-degraded-style auto-issue (mirroring the realtime probe pattern from 13 May) so silent breakage becomes visible.
"Site-wide consistency" is not a reason to keep something dishonest. A consistent lie is still a lie. If a feature fails the honesty rule, the right move is remove-everywhere, not keep-everywhere.
Re-verify inherited journal claims against code before relying on them. The 19 May session's "counter is real" finding looked authoritative but was unverified against the refresh pipeline's actual run history. Always check.
Empty data file > deleted data file when partials look it up via index hugo.Data. Reversibility + template safety + zero risk of nil chains.
Hard-disable the manual escape hatch when disabling a cron. Removing schedule: from a GHA workflow leaves workflow_dispatch as a one-click foot-gun for any future session that finds the workflow and tries to "test" it. Gate with a required confirmation input.

Cross-references:

Realtime Counter Playbook § "DON'T CONFUSE" — the disambiguation between this (now-removed) hardcoded counter and the still-live "X reading now" pill.
voice-and-tone.md — the honesty rule + brag-allergy this incident enforces.

How to use this log¶

Before adding a check that you think might be redundant: search this log first. The check probably has a 12-hour outage behind it.
Before removing a check that "looks paranoid": find the incident that drove it. If you can't find one, it's safer to leave the check.
After a new incident: add it here BEFORE deploying the fix. The guardrail goes live with the fix, not after.

Cross-references¶

Deployment Playbook — every step's incident origin
Parallel-Safe Git Rules — the 2 May incident expanded
Tool Integration Checklist — what shipped invisible and why
Zen System Quickref — the One Body Two Organs incident

Production Incident Log¶

2026¶

13 May — Realtime counter silent-zero (GA4 quota exhaustion)¶

14 May — Half-fix completed (cosmos error-cache) + KV refactor attempt failed¶

16 May — Permanent fix (Option A+): decouple public realtime from GA4 quota¶

10 May — Plain AI mobile-collapse blocked by CSP in production (false-green local)¶

8 May — Guided homepage v1 SampleExam focus mode broken (transparent overlay)¶

7 May — Plain AI mobile masthead "What changed" wraps to 2 lines¶

7 May — overflow-x:auto silently clips nav items¶

6 May — Phantom // other wires label after layout move¶

6 May — Shift .tape marquee leaks 60px past viewport¶

6 May — Shift footer .byline + .masthead-foot-meta force horizontal scroll¶

6 May — Moon /explore/ cert grid: 37px overflow from long unbreakable cert slugs¶

6 May — PowerShell Invoke-WebRequest returns false 500 errors¶

6 May — Hugo's HTML minifier strips attribute quotes; regex tests false-flag¶

6 May — Tap targets across all planets were 36×36, not 44×44¶

5 May — Cloudflare _redirects silently not honoured¶

3 May — Cert-landing ↔ practice page checkout LOOP¶

2 May — Hugo↔Astro nav drift (One Body Two Organs)¶

2 May — Cross-session WIP contamination¶

30 April — Practice exam outage (12 hours)¶

24 April — Practice quiz options not rendering (string format)¶

April — 16 tools shipped without FAQPage JSON-LD¶

April — Tool registered, shipped invisible¶

20 May — Hardcoded tool counter badges removed (honesty rule + dead pipeline)¶

How to use this log¶

Cross-references¶

7 May — `overflow-x:auto` silently clips nav items¶

6 May — Phantom `// other wires` label after layout move¶

6 May — Shift `.tape` marquee leaks 60px past viewport¶

6 May — Shift footer `.byline` + `.masthead-foot-meta` force horizontal scroll¶

6 May — Moon `/explore/` cert grid: 37px overflow from long unbreakable cert slugs¶

6 May — PowerShell `Invoke-WebRequest` returns false 500 errors¶

5 May — Cloudflare `_redirects` silently not honoured¶