Skip to content

๐Ÿ› Deferred Tech-Debt Findings โ€” Guided

Last updated: 09 May 2026 โ€” post-22-bug remediation session

Audience: future-Sush, future-Claude (any session picking up tech-debt work)

TL;DR: Today's session shipped 6 commits and fixed 17 of 22 bugs surfaced by two bug-hunt agent passes. 5 items remain (4 code + 2 manual config; one item is โ“ช "do nothing" architectural acceptance). This doc lists them in priority order with full context so a future session can pick and act without re-investigation.


1st priority โ€” manual config (5 min, unblocks shipped code):

Enable charge.refunded event in Stripe Dashboard webhook config. Without this, the refund handler shipped in ffc0444 is dormant. Refunded users keep access for the 400-day KV TTL.

2nd priority โ€” manual config (10 min, closes shipped vulnerability):

Set HEALTH_CHECK_SECRET env var in CF Pages, then update uptime monitors to send ?key=<secret> or x-health-key: <secret> header. Without this, /api/health either over-shares (current state, env not set) or silently no-alerts (env set but monitors unaware).

3rd priority โ€” code (30โ€“60 min, real UX impact):

Finding A2/B5 โ€” fingerprint UX. IP-based device fingerprint causes false device-cap lockouts for VPN/mobile/travel users. Real users will hit this. Move to client-side UUID approach.

After that, the remaining items are cosmetic / operational and can be batched.


Open findings (priority order)

A2/B5 โ€” Fingerprint causes false device-cap lockouts (P3, real UX impact)

Field Value
Severity P3 (UX, low-frequency but high-impact-per-incident)
File functions/guided/api/activate.ts:69-73
Status Deferred from ffc0444 โ€” design tradeoff documented
Estimated effort 30โ€“60 min
Risk if shipped Medium โ€” touches client + server; needs careful migration

What's wrong: The current device fingerprint is SHA-256(IP + User-Agent + TOKEN_SECRET). A normal user with Chrome on: - Home WiFi โ†’ fingerprint A - Mobile data โ†’ fingerprint B (different IP) - Hotel WiFi โ†’ fingerprint C - Work VPN โ†’ fingerprint D

Within a week of normal travel, they've burned all 3 device slots and get:

"This key has reached its 3-device limit. Contact aguidetocloud@gmail.com for help."

There is no slot-reset endpoint and no device-management UI. Every locked-out user requires manual support intervention โ€” and on a $9 product, the support cost can exceed the revenue.

Repro: 1. Activate licence on home WiFi (slot 1 โ€” fingerprint = home-IP+UA) 2. Activate same licence on mobile data (slot 2 โ€” different IP, same UA = different fingerprint) 3. Activate same licence with VPN on (slot 3 โ€” different IP) 4. Try to use same browser at home after IP renewal โ€” new fingerprint, slot 4 attempted โ†’ blocked

Fix sketch (client-side UUID): 1. Add getOrCreateDeviceFingerprint() in src/lib/access.ts:

export function getOrCreateDeviceFingerprint(): string {
  const KEY = 'guided-device-fp';
  try {
    let fp = localStorage.getItem(KEY);
    if (!fp) {
      fp = crypto.randomUUID();
      localStorage.setItem(KEY, fp);
    }
    return fp;
  } catch { return ''; }
}
2. src/lib/checkout.ts activateLicenceKey() sends fingerprint in body alongside key. 3. functions/guided/api/activate.ts accepts client body.fingerprint (preferred) and falls back to server-derived (IP+UA) for backward compat with old clients. 4. Same logical browser โ†’ same UUID โ†’ idempotent regardless of network.

Dependencies / risks: - Old localStorage cleared (e.g., user clears site data) creates a new UUID, consuming a new slot. Acceptable: 3-device cap accommodates this. - Need to think about admin-mode interaction (does Sush's testing across browsers each consume a slot? Yes โ€” same as a real user. Fine.) - Cosmos session protection: NO cosmos files touched.


B3 โ€” data-vendor type mismatch on cert landing button (P3, cosmetic)

Field Value
Severity P3 (cosmetic โ€” vendorSlug not used for cert purchases)
Files src/pages/[slug]/index.astro:436 vs src/pages/[cert]/practice.astro:173
Status Deferred from ffc0444
Estimated effort 10 min
Risk if shipped Very low โ€” additive consistency fix

What's wrong: The two unlock buttons compute data-vendor differently: - Cert landing: data-vendor={certVendor?.slug || ''} โ€” but certVendor is a vendorMap object that doesn't have a .slug property. Always evaluates to empty string. - Practice page: data-vendor={certVendor || ''} โ€” certVendor here is certMeta?.vendor which is a string like 'microsoft'. Correct.

Why this isn't biting today: the vendor pass and all-access SKUs were retired (commit e99fb5d, 8 May 2026). Only productType === 'cert' is sold. The checkout API accepts vendorSlug as optional metadata only; it's not used for routing or pricing anymore.

Why it should still be fixed: if vendorSlug is ever brought back for analytics, marketing tracking, or a vendor-bundle relaunch, the silent inconsistency will make data partial.

Fix sketch: 1. Read certVendor.slug correctly in [slug]/index.astro:436. Likely the field is named differently โ€” check data/vendors.ts. 2. Add a Playwright assertion in test-guided-qa.cjs testCheckoutFlow: payload vendorSlug matches the expected vendor for that cert.

Dependencies / risks: - None critical. Pure additive fix.


B7 โ€” Health endpoint deployment window (P3, operational)

Field Value
Severity P3 (operational โ€” silent monitoring failure window)
File functions/guided/api/health.ts:27-33 (code is correct)
Status Deferred from ffc0444 โ€” needs ops doc, not code change
Estimated effort 15 min (docs only)
Risk if shipped None โ€” already shipped backward-compat

What's wrong (operational, not a code bug): The HEALTH_CHECK_SECRET env var is currently UNSET in CF Pages. Without the env var, /api/health returns full status (current behaviour). Once the env var IS set: - Monitors that don't yet send ?key=<secret> get { status: 'pong' } 200 - Real KV/Stripe/Resend health checks don't run for those monitors - Alert email isn't sent on degradations

There's a window between "env var set in CF dashboard" and "monitors updated" where degradations could go unnoticed.

Fix sketch (deployment runbook in learning-docs/docs/playground/guided/health-monitoring.md or here):

## Setting HEALTH_CHECK_SECRET (manual deployment runbook)

1. Generate a secret: `openssl rand -hex 32`
2. **FIRST:** update all monitors to send `?key=<secret>` (don't trigger the env var until monitors are ready):
   - GitHub Actions monitoring workflow (.github/workflows/*.yml)
   - External pingers (UptimeRobot, BetterStack, etc.)
   - Sush's phone health check ntfy.sh script
3. Verify monitors return 200 with full status (using a temporary identical secret on the monitor side; the endpoint still serves full status because env var unset).
4. **THEN:** set HEALTH_CHECK_SECRET in CF Pages dashboard.
5. Verify monitors continue working (now authoritatively gated).
6. Verify unauthenticated GET /api/health returns `{ status: 'pong' }` 200.

Dependencies / risks: - No code changes. Pure runbook.


B9 โ€” testMobileNavCosmos >=3 weaker than ===3 (P3, test precision)

Field Value
Severity P3 (test catches "drawer broken" but not "partial regression")
File test-guided-qa.cjs:329
Status Deferred โ€” wait for cosmos design session to settle
Estimated effort 5 min once cosmos count stabilizes
Risk if shipped now High โ€” would fight the active cosmos design session

What's wrong: Originally the test asserted planetLinks === 3. Cosmos was expanded to 6+ planets in commit a296596. To avoid the test failing during the active cosmos design work, today's session changed it to >=3. This catches "drawer empty" regressions but accepts a "4 of 6 planets missing" regression silently.

Fix sketch (when cosmos design is locked): 1. Determine the canonical planet count from cosmos-atlas/atlas.json or wherever the source-of-truth lives. 2. Change >= 3 to === <canonical>. 3. If the count varies per cert (e.g., based on adjacency), import the calculation from a shared module rather than hardcoding.

Dependencies: - Cosmos session must finish design iteration first. Check via: - Cosmos repo last commit date (currently very active โ€” 5 deploys today per journal) - Or wait until Sush signals "cosmos is locked"


#10 (P3) โ€” Checkout rate limit โ€” REVERTED

Status update 9 May 2026 (late session): The original "10/hr per IP" rate limit shipped in d75839d, then bumped to 30/hr in 948793b after CI alerts, then removed entirely in 226936d after Sush's "I don't like this change" call. See journal entry "Bonus 4: Rate-limit drama".

Why removed: - 10/hr was too tight for GHA pooled runner IPs (multiple commits/hr on busy days collided) - 30/hr fixed CI but the alert noise from earlier failures was already cascading - For a $9 product, the abuse vectors aren't realistic enough to justify the operational fragility

What's NOT closed: - A bot can still hammer /api/checkout to spam Stripe Checkout sessions - Realistically: Stripe has its own API-level rate limits; sustained abuse would trigger Stripe's fraud detection on the account; out-of-scope for a $9 product

If we ever bring it back: - Threshold must be sized for the worst-case legitimate caller pattern, not just real users: - Real users: 5/hr per IP - GHA workflows: 1 per push, but runner IPs are pooled โ€” count this at 10โ€“20/hr per pooled IP on busy days - Watchdog Worker: doesn't touch /api/checkout directly - Manual testing: can spike very high during dev sessions - Suggest: 60/hr per IP if reintroduced, with a User-Agent allowlist for known monitors - Add CI integration test: spam smoke tests in a synthetic burst and verify they don't hit the cap


#12 (P3) โ€” Health endpoint โ€” HEALTH_CHECK_SECRET gate dormant; rate limit reverted

Status update 9 May 2026 (late session): Rate limit on /api/health (60/hr) was added in 4bf8a3e, then removed entirely in 226936d after the watchdog Worker's egress IP got rate-limited and started ntfy-storming.

Current state: - /api/health is fully public again (back to original behavior) - HEALTH_CHECK_SECRET env-var gate code is still present (added in d75839d), but DORMANT โ€” works as before unless the env var is set in CF Pages dashboard - Future session can activate the secret-gate if the threat model justifies it

If we ever activate the gate, the deployment runbook is: 1. First โ€” update all monitors that hit /api/health: - .github/workflows/payment-health.yml โ€” add ?key=${{ secrets.HEALTH_CHECK_SECRET }} to the curl URL - .github/workflows/post-deploy-smoke.yml โ€” same - .github/workflows/auto-restore.yml โ€” same (2 places) - worker/guided-watchdog.mjs โ€” read env.HEALTH_CHECK_SECRET, append to URL - scripts/deploy-watchdog.ps1 โ€” add HEALTH_CHECK_SECRET binding to Set-WorkerSecret loop 2. Set HEALTH_CHECK_SECRET in GitHub repo secrets 3. Run pwsh scripts/deploy-watchdog.ps1 to redeploy the Worker with new code + secret 4. Then โ€” set HEALTH_CHECK_SECRET in CF Pages env vars (this activates the gate) 5. Verify: workflows + watchdog continue to get full status; unauthenticated GET /api/health returns { status: 'pong' } 200

Lessons from this session that go into the runbook: - Don't activate the env var until ALL monitors are confirmed to send the key - The watchdog and my own client testing can share a CF egress IP โ€” coordinate testing window - For a $9 product the gate is probably overkill โ€” don't activate without threat-model justification


Architectural acceptance (no fix needed)

#6 โ€” Question JSON publicly fetchable (/guided/data/questions/{cert}.json)

Documented as known design. The static-site architecture serves question data as static JSON without auth. Anyone with curl can extract the full question bank. The $9 paywall is UI convenience, not data secrecy. Future sessions: do NOT "fix" this โ€” it would require abandoning static deploy or adding a Cloudflare Function gate that adds latency and complexity.

If we ever want to gate question data: use signed URLs from a CF Function that takes a session token, returns short-lived signed URLs to question bundles. Major architectural change. Don't do unless paywall economics change drastically.


โš™๏ธ Manual config follow-ups (CF / Stripe dashboards)

Stripe Dashboard โ€” enable charge.refunded event

Why: The refund handler in webhook.ts was shipped in d75839d + race-fixed in fa1ac23 + sentinel-fixed in ffc0444. It works correctly. But it never receives events because the Stripe webhook config only forwards checkout.session.completed. Until this is enabled, refunded users keep access through the 400-day KV TTL.

Steps: 1. Stripe Dashboard โ†’ Developers โ†’ Webhooks 2. Find the endpoint for https://www.aguidetocloud.com/guided/api/webhook 3. Edit endpoint โ†’ Events to send โ†’ Add charge.refunded 4. Save 5. Test: issue a small Stripe test refund โ†’ verify CF Worker logs show Licence ... marked refunded โ†’ verify KV record has refundedAt set

Time: 5 min.

CF Pages โ€” HEALTH_CHECK_SECRET env var

Why: The health endpoint gate in d75839d is backward-compatible (works without env var) but doesn't actually gate anything until env var is set. See finding B7 above for the full runbook. Don't set this in CF until monitors are updated to pass the secret.

Time: 10โ€“20 min total (depends on monitor count).


โœ… Closed in the 9 May 2026 session (commit references)

For audit/incident-history purposes โ€” all 22 findings investigated and 17 fixed in one session.

Finding Severity Status Commit Notes
Race condition on Stripe checkout button P0 Fixed c2780f2 Document delegation; first child of BaseLayout slot
5 stale tests in test-guided-qa.cjs (+ 1 hidden 6th) โ€” Fixed 6c38f6f 31/31 green; real failures = real signal again
#1 Admin cookie spoof full bypass P0 Fixed 9f0e7f2 Removed isAdminMode() from hasAccess(); admin login writes localStorage
#2 selectFreeQuestions never called P1 Fixed 9f0e7f2 Wired into filteredQuestions useMemo
#3 Wrong certSlug on cert landing P1 Fixed 9f0e7f2 props.certCode \|\| fallback
#7 Dead fullscreen code P2 Removed 9f0e7f2 Script + CSS deleted
#8 matchAnswers stale closure P2 Fixed 9f0e7f2 Added to keyboard useEffect deps (not checkAnswer directly โ€” TDZ)
#11 certCode regex validation P3 Added 9f0e7f2 ^[a-z0-9]+(?:-[a-z0-9]+)*$ + 40-char cap
#13 productType allowlist P3 Added 9f0e7f2 ['cert','vendor','all'] check in verify.ts
#4 Activate KV race P1 Fixed d75839d + fa1ac23 Device fingerprint + LWW pattern; legacy migration patched
#5 No refund webhook handler P2 Fixed d75839d + fa1ac23 + ffc0444 Sentinel pattern handles event ordering; partial refunds excluded
#9 Admin token expiry P2 Added d75839d 7-day max age in verifyAdminAuth; crypto.randomUUID() for nonce
#10 No rate limit on /api/checkout P3 Added d75839d 10 per IP per hour, KV-based, fail-open
#12 Health endpoint public P3 Fixed d75839d Backward-compat gate via HEALTH_CHECK_SECRET
Self-1 activate.ts legacy migration reset counter P2 Patched fa1ac23 Math.max(devices.length, activations || 0)
Self-2 Refund revoked on partial refunds P2 Patched fa1ac23 if (!charge.refunded) return ignored
A1/B4 charge.refunded race before completion P2 Fixed ffc0444 Sentinel record on refund-before-completion
A3 Guided.student() no .catch() P3 Fixed ffc0444 Catch handler also clears localStorage
B11 sha256() lowercases TOKEN_SECRET P3 Fixed ffc0444 Added sha256Raw() helper

๐Ÿ“š Reference

  • Plan file with full investigation: ~/.copilot/session-state/1dde1bbf-a398-4a70-b2d4-58c9238e1acc/plan.md (session-bound; copy here if needed)
  • Bug-hunt agent verification suites:
  • test-guided-qa.cjs โ€” 31 checks (in repo)
  • race-verify.cjs โ€” race-fix regression (in session-state/files)
  • bug-hunt-verify.cjs โ€” 4 fix-specific Playwright tests (in session-state/files)
  • Incident history (with this session's entries): guided/.github/copilot-instructions.md ยง Incident history
  • Reliability architecture context: learning-docs/docs/playground/guided/reliability-architecture.md

๐Ÿง  Lessons captured for future sessions

These came out of the 9 May session and should inform how future sessions approach guided's tech debt:

  1. Multi-pass bug-hunt is dramatically more thorough. First pass found 13 bugs; second pass after fixing them found 7 NEW bugs (5 of which the human reviewer missed). Run the agent twice on every major remediation push.

  2. Self-review before agent verification is high-leverage. Caught 2 bugs (legacy-counter reset, partial-refund revocation) in own fixes BEFORE the agent reported back. Saves an iteration cycle. Treat every diff as something to read once more before merging โ€” "Author Smell" pre-commit habit.

  3. Set + LWW is a real pattern for KV-only stacks. Cloudflare KV has no atomic operations, but storing concurrent-modification state as a Set + accepting last-writer-wins bounds the race correctly. No Durable Object needed for activate.ts. Same pattern applies to any "list of things" with a cap.

  4. Stripe event ordering is genuinely undefined. Refund-before-completion needed a sentinel pattern. Always handle out-of-order webhook events idempotently. Pre-write sentinel + post-write check + early-return when state already terminal.

  5. charge.refunded === true vs partial refund matters. Stripe fires the event for ALL refunds (full + partial), but the boolean only flips for full. Partial refunds without this check incorrectly revoke full access.

  6. Generic utility functions can have implicit assumptions. sha256() was designed for email keys โ†’ lowercase + trim. Reusing it for fingerprints silently weakens entropy. Add a sha256Raw variant when re-using crypto helpers in new contexts.

  7. TDZ in React useEffect deps is a footgun. Listing useCallback references that are declared LATER in the file CRASHES at render with ReferenceError: Cannot access 'X' before initialization. List state-deps that re-trigger the effect transitively, not callback-deps directly. The closure picks up fresh callbacks for free.

  8. Client-side paywalls are inherently leaky. The selectFreeQuestions deterministic-subset approach is the right design โ€” it minimizes question exposure to free users. Anyone with curl can still hit /guided/data/questions/.... The $9 price + UX convenience IS the actual product, not the question secrecy. Documented to prevent future "fixes" that would break the static deploy.