🐛 Deferred Tech-Debt Findings — Guided¶

Last updated: 09 May 2026 — post-22-bug remediation session

Audience: future-Sush, future-Claude (any session picking up tech-debt work)

TL;DR: Today's session shipped 6 commits and fixed 17 of 22 bugs surfaced by two bug-hunt agent passes. 5 items remain (4 code + 2 manual config; one item is ⓪ "do nothing" architectural acceptance). This doc lists them in priority order with full context so a future session can pick and act without re-investigation.

🎯 Recommended next pick¶

1st priority — manual config (5 min, unblocks shipped code):

Enable charge.refunded event in Stripe Dashboard webhook config. Without this, the refund handler shipped in ffc0444 is dormant. Refunded users keep access for the 400-day KV TTL.

2nd priority — manual config (10 min, closes shipped vulnerability):

Set HEALTH_CHECK_SECRET env var in CF Pages, then update uptime monitors to send ?key=<secret> or x-health-key: <secret> header. Without this, /api/health either over-shares (current state, env not set) or silently no-alerts (env set but monitors unaware).

3rd priority — code (30–60 min, real UX impact):

Finding A2/B5 — fingerprint UX. IP-based device fingerprint causes false device-cap lockouts for VPN/mobile/travel users. Real users will hit this. Move to client-side UUID approach.

After that, the remaining items are cosmetic / operational and can be batched.

Open findings (priority order)¶

A2/B5 — Fingerprint causes false device-cap lockouts (P3, real UX impact)¶

Field	Value
Severity	P3 (UX, low-frequency but high-impact-per-incident)
File	`functions/guided/api/activate.ts:69-73`
Status	Deferred from `ffc0444` — design tradeoff documented
Estimated effort	30–60 min
Risk if shipped	Medium — touches client + server; needs careful migration

What's wrong: The current device fingerprint is SHA-256(IP + User-Agent + TOKEN_SECRET). A normal user with Chrome on: - Home WiFi → fingerprint A - Mobile data → fingerprint B (different IP) - Hotel WiFi → fingerprint C - Work VPN → fingerprint D

Within a week of normal travel, they've burned all 3 device slots and get:

"This key has reached its 3-device limit. Contact aguidetocloud@gmail.com for help."

There is no slot-reset endpoint and no device-management UI. Every locked-out user requires manual support intervention — and on a $9 product, the support cost can exceed the revenue.

Repro: 1. Activate licence on home WiFi (slot 1 — fingerprint = home-IP+UA) 2. Activate same licence on mobile data (slot 2 — different IP, same UA = different fingerprint) 3. Activate same licence with VPN on (slot 3 — different IP) 4. Try to use same browser at home after IP renewal — new fingerprint, slot 4 attempted → blocked

Fix sketch (client-side UUID): 1. Add getOrCreateDeviceFingerprint() in src/lib/access.ts:

export function getOrCreateDeviceFingerprint(): string {
  const KEY = 'guided-device-fp';
  try {
    let fp = localStorage.getItem(KEY);
    if (!fp) {
      fp = crypto.randomUUID();
      localStorage.setItem(KEY, fp);
    }
    return fp;
  } catch { return ''; }
}

2. src/lib/checkout.ts activateLicenceKey() sends fingerprint in body alongside key. 3. functions/guided/api/activate.ts accepts client body.fingerprint (preferred) and falls back to server-derived (IP+UA) for backward compat with old clients. 4. Same logical browser → same UUID → idempotent regardless of network.

Dependencies / risks: - Old localStorage cleared (e.g., user clears site data) creates a new UUID, consuming a new slot. Acceptable: 3-device cap accommodates this. - Need to think about admin-mode interaction (does Sush's testing across browsers each consume a slot? Yes — same as a real user. Fine.) - Cosmos session protection: NO cosmos files touched.

B3 — `data-vendor` type mismatch on cert landing button (P3, cosmetic)¶

Field	Value
Severity	P3 (cosmetic — vendorSlug not used for cert purchases)
Files	`src/pages/[slug]/index.astro:436` vs `src/pages/[cert]/practice.astro:173`
Status	Deferred from `ffc0444`
Estimated effort	10 min
Risk if shipped	Very low — additive consistency fix

What's wrong: The two unlock buttons compute data-vendor differently: - Cert landing: data-vendor={certVendor?.slug || ''} — but certVendor is a vendorMap object that doesn't have a .slug property. Always evaluates to empty string. - Practice page: data-vendor={certVendor || ''} — certVendor here is certMeta?.vendor which is a string like 'microsoft'. Correct.

Why this isn't biting today: the vendor pass and all-access SKUs were retired (commit e99fb5d, 8 May 2026). Only productType === 'cert' is sold. The checkout API accepts vendorSlug as optional metadata only; it's not used for routing or pricing anymore.

Why it should still be fixed: if vendorSlug is ever brought back for analytics, marketing tracking, or a vendor-bundle relaunch, the silent inconsistency will make data partial.

Fix sketch: 1. Read certVendor.slug correctly in [slug]/index.astro:436. Likely the field is named differently — check data/vendors.ts. 2. Add a Playwright assertion in test-guided-qa.cjs testCheckoutFlow: payload vendorSlug matches the expected vendor for that cert.

Dependencies / risks: - None critical. Pure additive fix.

B7 — Health endpoint deployment window (P3, operational)¶

Field	Value
Severity	P3 (operational — silent monitoring failure window)
File	`functions/guided/api/health.ts:27-33` (code is correct)
Status	Deferred from `ffc0444` — needs ops doc, not code change
Estimated effort	15 min (docs only)
Risk if shipped	None — already shipped backward-compat

What's wrong (operational, not a code bug): The HEALTH_CHECK_SECRET env var is currently UNSET in CF Pages. Without the env var, /api/health returns full status (current behaviour). Once the env var IS set: - Monitors that don't yet send ?key=<secret> get { status: 'pong' } 200 - Real KV/Stripe/Resend health checks don't run for those monitors - Alert email isn't sent on degradations

There's a window between "env var set in CF dashboard" and "monitors updated" where degradations could go unnoticed.

Fix sketch (deployment runbook in learning-docs/docs/playground/guided/health-monitoring.md or here):

## Setting HEALTH_CHECK_SECRET (manual deployment runbook)

1. Generate a secret: `openssl rand -hex 32`
2. **FIRST:** update all monitors to send `?key=<secret>` (don't trigger the env var until monitors are ready):
   - GitHub Actions monitoring workflow (.github/workflows/*.yml)
   - External pingers (UptimeRobot, BetterStack, etc.)
   - Sush's phone health check ntfy.sh script
3. Verify monitors return 200 with full status (using a temporary identical secret on the monitor side; the endpoint still serves full status because env var unset).
4. **THEN:** set HEALTH_CHECK_SECRET in CF Pages dashboard.
5. Verify monitors continue working (now authoritatively gated).
6. Verify unauthenticated GET /api/health returns `{ status: 'pong' }` 200.

Dependencies / risks: - No code changes. Pure runbook.

B9 — `testMobileNavCosmos` `>=3` weaker than `===3` (P3, test precision)¶

Field	Value
Severity	P3 (test catches "drawer broken" but not "partial regression")
File	`test-guided-qa.cjs:329`
Status	Deferred — wait for cosmos design session to settle
Estimated effort	5 min once cosmos count stabilizes
Risk if shipped now	High — would fight the active cosmos design session

What's wrong: Originally the test asserted planetLinks === 3. Cosmos was expanded to 6+ planets in commit a296596. To avoid the test failing during the active cosmos design work, today's session changed it to >=3. This catches "drawer empty" regressions but accepts a "4 of 6 planets missing" regression silently.

Fix sketch (when cosmos design is locked): 1. Determine the canonical planet count from cosmos-atlas/atlas.json or wherever the source-of-truth lives. 2. Change >= 3 to === <canonical>. 3. If the count varies per cert (e.g., based on adjacency), import the calculation from a shared module rather than hardcoding.

Dependencies: - Cosmos session must finish design iteration first. Check via: - Cosmos repo last commit date (currently very active — 5 deploys today per journal) - Or wait until Sush signals "cosmos is locked"

#10 (P3) — Checkout rate limit — REVERTED¶

Status update 9 May 2026 (late session): The original "10/hr per IP" rate limit shipped in d75839d, then bumped to 30/hr in 948793b after CI alerts, then removed entirely in 226936d after Sush's "I don't like this change" call. See journal entry "Bonus 4: Rate-limit drama".

Why removed: - 10/hr was too tight for GHA pooled runner IPs (multiple commits/hr on busy days collided) - 30/hr fixed CI but the alert noise from earlier failures was already cascading - For a $9 product, the abuse vectors aren't realistic enough to justify the operational fragility

What's NOT closed: - A bot can still hammer /api/checkout to spam Stripe Checkout sessions - Realistically: Stripe has its own API-level rate limits; sustained abuse would trigger Stripe's fraud detection on the account; out-of-scope for a $9 product

If we ever bring it back: - Threshold must be sized for the worst-case legitimate caller pattern, not just real users: - Real users: 5/hr per IP - GHA workflows: 1 per push, but runner IPs are pooled — count this at 10–20/hr per pooled IP on busy days - Watchdog Worker: doesn't touch /api/checkout directly - Manual testing: can spike very high during dev sessions - Suggest: 60/hr per IP if reintroduced, with a User-Agent allowlist for known monitors - Add CI integration test: spam smoke tests in a synthetic burst and verify they don't hit the cap

#12 (P3) — Health endpoint — `HEALTH_CHECK_SECRET` gate dormant; rate limit reverted¶

Status update 9 May 2026 (late session): Rate limit on /api/health (60/hr) was added in 4bf8a3e, then removed entirely in 226936d after the watchdog Worker's egress IP got rate-limited and started ntfy-storming.

Current state: - /api/health is fully public again (back to original behavior) - HEALTH_CHECK_SECRET env-var gate code is still present (added in d75839d), but DORMANT — works as before unless the env var is set in CF Pages dashboard - Future session can activate the secret-gate if the threat model justifies it

If we ever activate the gate, the deployment runbook is: 1. First — update all monitors that hit /api/health: - .github/workflows/payment-health.yml — add ?key=${{ secrets.HEALTH_CHECK_SECRET }} to the curl URL - .github/workflows/post-deploy-smoke.yml — same - .github/workflows/auto-restore.yml — same (2 places) - worker/guided-watchdog.mjs — read env.HEALTH_CHECK_SECRET, append to URL - scripts/deploy-watchdog.ps1 — add HEALTH_CHECK_SECRET binding to Set-WorkerSecret loop 2. Set HEALTH_CHECK_SECRET in GitHub repo secrets 3. Run pwsh scripts/deploy-watchdog.ps1 to redeploy the Worker with new code + secret 4. Then — set HEALTH_CHECK_SECRET in CF Pages env vars (this activates the gate) 5. Verify: workflows + watchdog continue to get full status; unauthenticated GET /api/health returns { status: 'pong' } 200

Lessons from this session that go into the runbook: - Don't activate the env var until ALL monitors are confirmed to send the key - The watchdog and my own client testing can share a CF egress IP — coordinate testing window - For a $9 product the gate is probably overkill — don't activate without threat-model justification

Architectural acceptance (no fix needed)¶

#6 — Question JSON publicly fetchable (`/guided/data/questions/{cert}.json`)¶

Documented as known design. The static-site architecture serves question data as static JSON without auth. Anyone with curl can extract the full question bank. The $9 paywall is UI convenience, not data secrecy. Future sessions: do NOT "fix" this — it would require abandoning static deploy or adding a Cloudflare Function gate that adds latency and complexity.

If we ever want to gate question data: use signed URLs from a CF Function that takes a session token, returns short-lived signed URLs to question bundles. Major architectural change. Don't do unless paywall economics change drastically.

⚙️ Manual config follow-ups (CF / Stripe dashboards)¶

Stripe Dashboard — enable `charge.refunded` event¶

Why: The refund handler in webhook.ts was shipped in d75839d + race-fixed in fa1ac23 + sentinel-fixed in ffc0444. It works correctly. But it never receives events because the Stripe webhook config only forwards checkout.session.completed. Until this is enabled, refunded users keep access through the 400-day KV TTL.

Steps: 1. Stripe Dashboard → Developers → Webhooks 2. Find the endpoint for https://www.aguidetocloud.com/guided/api/webhook 3. Edit endpoint → Events to send → Add charge.refunded 4. Save 5. Test: issue a small Stripe test refund → verify CF Worker logs show Licence ... marked refunded → verify KV record has refundedAt set

Time: 5 min.

CF Pages — `HEALTH_CHECK_SECRET` env var¶

Why: The health endpoint gate in d75839d is backward-compatible (works without env var) but doesn't actually gate anything until env var is set. See finding B7 above for the full runbook. Don't set this in CF until monitors are updated to pass the secret.

Time: 10–20 min total (depends on monitor count).

✅ Closed in the 9 May 2026 session (commit references)¶

For audit/incident-history purposes — all 22 findings investigated and 17 fixed in one session.

Finding	Severity	Status	Commit	Notes
Race condition on Stripe checkout button	P0	Fixed	`c2780f2`	Document delegation; first child of BaseLayout slot
5 stale tests in `test-guided-qa.cjs` (+ 1 hidden 6th)	—	Fixed	`6c38f6f`	31/31 green; real failures = real signal again
#1 Admin cookie spoof full bypass	P0	Fixed	`9f0e7f2`	Removed `isAdminMode()` from `hasAccess()`; admin login writes localStorage
#2 `selectFreeQuestions` never called	P1	Fixed	`9f0e7f2`	Wired into `filteredQuestions` useMemo
#3 Wrong `certSlug` on cert landing	P1	Fixed	`9f0e7f2`	`props.certCode \\|\\|` fallback
#7 Dead fullscreen code	P2	Removed	`9f0e7f2`	Script + CSS deleted
#8 `matchAnswers` stale closure	P2	Fixed	`9f0e7f2`	Added to keyboard useEffect deps (not `checkAnswer` directly — TDZ)
#11 `certCode` regex validation	P3	Added	`9f0e7f2`	`^[a-z0-9]+(?:-[a-z0-9]+)*$` + 40-char cap
#13 `productType` allowlist	P3	Added	`9f0e7f2`	`['cert','vendor','all']` check in verify.ts
#4 Activate KV race	P1	Fixed	`d75839d` + `fa1ac23`	Device fingerprint + LWW pattern; legacy migration patched
#5 No refund webhook handler	P2	Fixed	`d75839d` + `fa1ac23` + `ffc0444`	Sentinel pattern handles event ordering; partial refunds excluded
#9 Admin token expiry	P2	Added	`d75839d`	7-day max age in `verifyAdminAuth`; `crypto.randomUUID()` for nonce
#10 No rate limit on `/api/checkout`	P3	Added	`d75839d`	10 per IP per hour, KV-based, fail-open
#12 Health endpoint public	P3	Fixed	`d75839d`	Backward-compat gate via `HEALTH_CHECK_SECRET`
Self-1 `activate.ts` legacy migration reset counter	P2	Patched	`fa1ac23`	`Math.max(devices.length, activations \|\| 0)`
Self-2 Refund revoked on partial refunds	P2	Patched	`fa1ac23`	`if (!charge.refunded) return ignored`
A1/B4 charge.refunded race before completion	P2	Fixed	`ffc0444`	Sentinel record on refund-before-completion
A3 `Guided.student()` no `.catch()`	P3	Fixed	`ffc0444`	Catch handler also clears localStorage
B11 `sha256()` lowercases TOKEN_SECRET	P3	Fixed	`ffc0444`	Added `sha256Raw()` helper

📚 Reference¶

Plan file with full investigation: ~/.copilot/session-state/1dde1bbf-a398-4a70-b2d4-58c9238e1acc/plan.md (session-bound; copy here if needed)
Bug-hunt agent verification suites:
test-guided-qa.cjs — 31 checks (in repo)
race-verify.cjs — race-fix regression (in session-state/files)
bug-hunt-verify.cjs — 4 fix-specific Playwright tests (in session-state/files)
Incident history (with this session's entries): guided/.github/copilot-instructions.md § Incident history
Reliability architecture context: learning-docs/docs/playground/guided/reliability-architecture.md

🧠 Lessons captured for future sessions¶

These came out of the 9 May session and should inform how future sessions approach guided's tech debt:

Multi-pass bug-hunt is dramatically more thorough. First pass found 13 bugs; second pass after fixing them found 7 NEW bugs (5 of which the human reviewer missed). Run the agent twice on every major remediation push.
Self-review before agent verification is high-leverage. Caught 2 bugs (legacy-counter reset, partial-refund revocation) in own fixes BEFORE the agent reported back. Saves an iteration cycle. Treat every diff as something to read once more before merging — "Author Smell" pre-commit habit.
Set + LWW is a real pattern for KV-only stacks. Cloudflare KV has no atomic operations, but storing concurrent-modification state as a Set + accepting last-writer-wins bounds the race correctly. No Durable Object needed for activate.ts. Same pattern applies to any "list of things" with a cap.
Stripe event ordering is genuinely undefined. Refund-before-completion needed a sentinel pattern. Always handle out-of-order webhook events idempotently. Pre-write sentinel + post-write check + early-return when state already terminal.
charge.refunded === true vs partial refund matters. Stripe fires the event for ALL refunds (full + partial), but the boolean only flips for full. Partial refunds without this check incorrectly revoke full access.
Generic utility functions can have implicit assumptions. sha256() was designed for email keys → lowercase + trim. Reusing it for fingerprints silently weakens entropy. Add a sha256Raw variant when re-using crypto helpers in new contexts.
TDZ in React useEffect deps is a footgun. Listing useCallback references that are declared LATER in the file CRASHES at render with ReferenceError: Cannot access 'X' before initialization. List state-deps that re-trigger the effect transitively, not callback-deps directly. The closure picks up fresh callbacks for free.
Client-side paywalls are inherently leaky. The selectFreeQuestions deterministic-subset approach is the right design — it minimizes question exposure to free users. Anyone with curl can still hit /guided/data/questions/.... The $9 price + UX convenience IS the actual product, not the question secrecy. Documented to prevent future "fixes" that would break the static deploy.