🛡️ Reliability & Redundancy Architecture¶
Last updated: 03 May 2026 (post-Michelle incident, full belt-and-braces buildout)
Audience: future-Sush, future-Claude, anyone investigating "what happens when X breaks"
TL;DR: the Guided platform has 9 independent layers of detection, alerting, self-healing, and recovery. A single env-var wipe used to mean a 5-hour outage and a paid customer messaging on LinkedIn. Now: detected within 5 min, auto-healed within 8 min, customer waits at most one Stripe-retry cycle for their email.
🚨 The incident that birthed this architecture¶
03 May 2026 — Michelle Alexander (wrightalways09@gmail.com) paid $18 for two practice exams (AB-620 + AB-900). She got nothing — no licence keys, no email, no receipt. She found Sush via LinkedIn 5 hours later.
Three simultaneous bugs:
| # | Bug | Layer |
|---|---|---|
| 1 | certCode='guided' shipped to Stripe metadata |
Code (cert-landing slug-extraction) |
| 2 | STRIPE_WEBHOOK_SECRET wiped from CF Pages env vars |
Operational (CF API PATCH side-effect) |
| 3 | RESEND_API_KEY also wiped |
Operational (same wipe class) |
Customer impact: $18 paid, zero received, hard reputation hit on a paid product.
Recovery on the day: env vars restored, KV manually populated with correct cert codes + idempotency locks (so Stripe retries see "already processed"), apology email sent with both keys.
Lesson burned in: every single layer of the failure was findable in advance with the right monitoring. None of it was an architectural problem. It was operational maturity debt accumulated faster than it could be paid down. The 9 layers below are the payment.
🏗️ Architecture overview¶
┌─────────────────────────┐
│ Sush's phone (ntfy.sh) │
└──────────▲──────────────┘
│ push
┌──────────────────┐ ┌──────────────┴──────────────┐
│ Stripe alerts │────────▶│ GH Actions cron (10 min) │──┐
│ (email, opt) │ │ /api/health probe │ │
└──────────────────┘ └─────────────────────────────┘ │
▲ │
┌──────────────────┐ │ workflow_dispatch│
│ CF Worker │────────────────────────┤ │
│ watchdog (5 min)│ │ │
└──────────────────┘ │ │ degraded
│ │
┌────────────────┴────────────────┐ │
│ UptimeRobot (5 min, external) │ │
│ → email + their app push │ │
└─────────────────────────────────┘ │
│
┌───────────────────────────────────▼┐
│ Auto-restore workflow: │
│ 1. Pre-check (skip if recovered) │
│ 2. Verify CF token from runner │
│ 3. PATCH CF Pages env vars │
│ 4. Trigger redeploy │
│ 5. Wait for deploy (max 6 min) │
│ 6. Verify health is now green │
│ 7. ntfy: 🟢 success / 🚨 failure │
└────────────────────────────────────┘
┌──────────────────┐ ┌────────────────────────────┐
│ Pages Functions │────────▶│ Sentry (when DSN set) │
│ webhook/verify/… │ │ stack traces + replay │
└──────────────────┘ └────────────────────────────┘
┌─────────────────┐ ┌────────────────────────────┐
│ GUIDED_KV │─ daily ─▶│ GH Actions artifact │
└─────────────────┘ │ gzipped JSON snapshots │
└────────────────────────────┘
Public: https://www.aguidetocloud.com/guided/status/
Synthetic checkout test: every cron tick (catches slug-bug class)
Post-deploy smoke test: after every push (catches click-flow regressions)
🔌 The 9 layers¶
| # | Layer | Cadence | Latency | Auto-fixes? | File |
|---|---|---|---|---|---|
| 1 | Auto-restore workflow | On workflow_run failure |
~3–4 min | ✅ Full self-heal | .github/workflows/auto-restore.yml |
| 2 | ntfy.sh phone push | On every alert event | seconds | ❌ Alert only | embedded in workflows |
| 3 | UptimeRobot | 5 min, external | ~5–10 min | ❌ Email alert only | UptimeRobot dashboard |
| 4 | Stripe webhook alerts | On Stripe-side failure | ~30s | ❌ Email alert only | Stripe Dashboard → Webhooks |
| 5 | Synthetic checkout test | Every cron tick | <1s | ❌ Detection only | payment-health.yml |
| 6 | Daily KV backup | 03:00 UTC daily | <1 min | ❌ Recovery aid | kv-backup.yml |
| 7 | Sentry error tracking | On every Function exception | seconds | ❌ Visibility only | (when DSN configured) |
| 8 | Public status page | Polls every 60s client-side | seconds | ❌ Customer trust | src/pages/status.astro |
| 9 | CF Worker watchdog | 5 min, on CF infra | ~5–8 min | ✅ Triggers auto-restore | worker/guided-watchdog.mjs |
Plus the in-code safety nets (Webhook fail-loud, Stripe Idempotency-Key, Stripe 3-day retry window) covered separately below.
🩺 Detection layer — three independent crons¶
The whole system depends on someone noticing degraded health within 5–10 min. We have three independent watchers, each with different reliability characteristics:
9a. CF Worker watchdog (primary)¶
File: worker/guided-watchdog.mjs deployed to CF Workers as a separate Worker (NOT a Pages Function — Pages Functions don't support cron triggers).
- Cron:
*/5 * * * *(every 5 min, on CF's reliable cron infrastructure) - What it does:
fetch /api/health- If
status !== 'healthy'→ ntfy push (priority=5) + GitHub APIworkflow_dispatchofauto-restore.yml - Bindings (Worker secrets):
NTFY_TOPIC,GH_PAT - Bindings (plain text):
GH_REPO,PROD_HEALTH - Endpoints:
GET /— infoGET /__health— proxies prod health (debug)POST /__trigger— runs full code path manually- Manual test URL:
https://guided-watchdog.susanth-ss.workers.dev/__trigger - Deploy:
pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1
Why this exists separately from the GH Actions cron: GitHub's free-tier scheduled workflows have 5–30 min drift during busy periods. Today's stress test (03 May 2026) showed the GH cron sat for 13 min after a forced wipe before firing. CF Workers cron is reliable to within ~1 min.
9b. GitHub Actions cron (backup)¶
File: .github/workflows/payment-health.yml
- Cron:
*/10 * * * *(every 10 min) - What it does:
- Curls
/api/health, fails the run if not"status":"healthy" - Stripe webhook delivery health (counts pending webhooks)
- Synthetic checkout — POST to
/api/checkoutwith az-900, retrieve resulting Stripe session, assertmetadata.certCode === 'az-900'(catches the slug-bug class) - On failure: ntfy push, opens GitHub issue, also triggers
auto-restore.ymlviaworkflow_runevent - Drift is ok here because the CF Worker is faster
9c. UptimeRobot (independent third party)¶
External vendor, free tier, 5-min interval, 4 monitors:
| Monitor | URL | Type |
|---|---|---|
| Health endpoint | /guided/api/health |
Keyword: healthy (alert if missing) |
| Cert landing | /guided/az-900/ |
HTTP(s) 200 |
| Practice page | /guided/az-900/practice/ |
HTTP(s) 200 |
| Questions JSON | /guided/data/questions/az-900.json |
HTTP(s) 200 |
Alert contacts: Email (susanth.ss@gmail.com), optional UptimeRobot mobile app push.
Why three crons? If GH Actions itself is having an outage, the CF Worker still detects. If both are down, UptimeRobot still alerts via email. Three independent failure modes covered.
🩹 Self-healing layer — auto-restore workflow¶
File: .github/workflows/auto-restore.yml
Triggers:
- workflow_run on 🏥 Payment Health Monitor completion with failure conclusion
- workflow_dispatch (manual or via CF Worker watchdog API call)
Steps:
- Re-check health — avoid PATCHing on transient blips
- Verify CF API token works from this runner — diagnostic, fails fast on IP allowlist or whitespace-corrupted tokens. Sends ntfy push with the dashboard URL to fix it.
- PATCH Cloudflare Pages env vars with all 7 secrets from GH repo secrets
- Trigger redeploy so Functions reload with the new values
- Wait for deploy (max 6 min)
- Verify health is now green — fail-loud if not
- Phone push — 🟢 on success, 🚨 + GitHub issue on failure
Mirrored secrets (GH repo secrets → CF env vars):
ADMIN_PASSWORD ← guided-admin-password
CLOUDFLARE_API_TOKEN ← cloudflare-api-token (used to PATCH itself)
RESEND_API_KEY ← resend-api-key
STRIPE_PUBLISHABLE_KEY ← stripe-live-publishable-key
STRIPE_SECRET_KEY ← stripe-live-secret-key
STRIPE_WEBHOOK_SECRET ← stripe-live-webhook-secret
TOKEN_SECRET ← guided-token-secret
NTFY_TOPIC (only used in workflows, not pushed to CF)
PERSONAL_PAT (legacy, not used after switching to github.token)
STRIPE_LIVE_SECRET_KEY (alias of STRIPE_SECRET_KEY for the webhook-pending check)
Trust model: GH repo secrets and CF env vars hold the same values. If either account is compromised, the system is exposed. Same trust surface either way; the redundancy is for availability, not security.
📦 Code-level safety nets¶
These are inside functions/guided/api/*.ts — they make the system resilient to env-var loss, race conditions, and transient failures.
Deterministic licence keys¶
File: functions/lib/utils.ts — deriveLicenceKey(sessionId, secret)
HMAC-SHA256(TOKEN_SECRET, session.id) → 12 chars from "ABCDEFGHJKLMNPQRSTUVWXYZ23456789"
→ "GD-XXXX-XXXX-XXXX"
Both the webhook and /api/verify derive the same key from the same session.id. No more split-brain (where verify mints a random key A, webhook mints random key B, customer ends up with two records).
Backward-compat: webhook checks session:${id} first; if a pre-deterministic random key already exists for that session, it respects it.
Fail-loud webhook¶
File: functions/guided/api/webhook.ts
Returns 500 (not silent 200) on:
- Missing email in paid session
- Missing productType metadata
- RESEND_API_KEY not set
- Resend API returns non-200
- Resend fetch throws
Why 500? Stripe retries failed webhook deliveries for 3 days. Fail-loud means a customer's email path is eventually consistent with system health — the system breaks → operations notice → operations fix → Stripe's next retry succeeds → customer gets email.
The OLD behaviour was silent 200, no retry, no alert, customer silently lost. That's the Michelle case.
Resend Idempotency-Key¶
Resend deduplicates retries with the same key. Even if the post-send resendEmailId KV write fails (network blip), Stripe retry → Resend dedup → no duplicate email to customer.
KV write order: licence first, session lookup is the commit point¶
1. PUT licence:KEY (record) ← ok if this fails alone — retry will recreate
2. PUT session:SID → KEY (lookup) ← THIS is the "commit" — once written, retry sees alreadyProcessed
3. PUT email:HASH → [keys] ← best-effort backfill on retry path
4. Send email ← fail-loud, 500 → Stripe retries
5. PUT licence:KEY (with resendEmailId stamped) ← observability only, not gating
If we crash anywhere, the next retry recovers cleanly because the licence key is deterministic.
📊 Public status page¶
File: src/pages/status.astro → https://www.aguidetocloud.com/guided/status/
Polls /api/health every 60s client-side. Shows colored dot per component (KV, Resend, Stripe, env vars). No backend dependency beyond the same health endpoint our monitors use.
Customer trust + at-a-glance ops view from any device.
🧪 Test layer — what catches what¶
| Test | Runs when | Catches |
|---|---|---|
Pre-push: test-guided-qa.cjs |
Manual before pushing PracticeQuiz changes | React hooks violations, option-text rendering, click flow, checkout flow |
| Post-deploy smoke | After every push to main (with 3-min wait for CF deploy) | dataUrl path drift, click-flow regressions, cert-unlock-btn presence |
| Synthetic checkout | Every payment-health cron tick | certCode='guided' slug bug regression |
| Health endpoint self-test | Every cron tick | Env var wipes, Stripe API failure, Resend API failure, KV failure |
| CF Worker watchdog | Every 5 min | Same as health, but with reliable cron |
💾 Recovery — when things go very wrong¶
Env vars wiped (cron will catch it; here's the manual one-liner)¶
Reads from ~/.copilot/secrets/, PATCHes CF Pages with all 7 secrets, triggers redeploy, runs SLA smoke. Exit codes: 0 ok, 1 patch failed, 2 deploy failed, 3 SLA smoke failed.
KV corrupted / accidentally deleted¶
- Download latest artifact from Actions → Daily KV Backup
- Unzip → JSON file with all keys + values + expiration timestamps
- PUT each key back via CF KV API (script TBD when needed — has not happened yet)
A customer paid but didn't get email (the Michelle scenario)¶
If the webhook is returning 500 and Stripe is in retry mode, fix the underlying issue (run restore-cf-env.ps1) — Stripe will retry within 3-day window and email will arrive automatically.
If retries are exhausted (>3 days):
# Find the customer's Stripe session
$stripeKey = (Get-Content "$env:USERPROFILE\.copilot\secrets\stripe-live-secret-key" -Raw).Trim()
curl.exe -s -G "https://api.stripe.com/v1/checkout/sessions?limit=25" -u "${stripeKey}:" | ConvertFrom-Json
# Manually generate licence + write to KV + send email via Resend
# Pattern: see the actions taken in the 03 May 2026 incident response.
CF Worker watchdog stops firing¶
Check:
1. Worker is deployed: https://guided-watchdog.susanth-ss.workers.dev returns 200
2. Cron is registered: CF Dashboard → Workers → guided-watchdog → Triggers
3. Worker secrets are set: CF Dashboard → Workers → guided-watchdog → Settings → Variables
4. Test manually: curl -X POST https://guided-watchdog.susanth-ss.workers.dev/__trigger
Re-deploy: pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1 (idempotent).
📜 Alert flow — what your phone shows¶
| Trigger | Latency | Phone push title |
|---|---|---|
| CF Worker watchdog detects degraded | ~5 min | 🔴 "Watchdog: Guided degraded — auto-restore dispatching" |
| GH cron detects degraded (backup) | ~10–30 min | 🔴 "Guided: payment health degraded" |
| Auto-restore starts | within 1 min of dispatch | 🔴 "Guided: degraded — auto-restoring now" |
| Auto-restore wins | ~3–4 min after start | 🟢 "Guided: auto-restored successfully" |
| Auto-restore fails | within 1 min of failure | 🚨 "Guided: AUTO-RESTORE FAILED" + GitHub issue opened |
| Post-deploy smoke fails | within 30s of detection | 🚨 "Guided: post-deploy smoke FAILED" + GitHub issue |
| KV backup fails | within 30s of detection | ⚠️ "Guided: KV backup FAILED" (non-urgent) |
| UptimeRobot detects 5xx | 5–10 min | Email to your Gmail |
| Stripe webhook delivery fails (when enabled) | ~30s | Email from Stripe |
ntfy topic: guided-alerts-mKTnVVZhHcGA (saved at ~/.copilot/secrets/guided-ntfy-topic).
🪤 Process traps to avoid (lessons in scar tissue)¶
Never pipe secrets via PowerShell stdin to gh CLI¶
# ❌ DO NOT — appends a CRLF, corrupts the stored value
$value | gh secret set NAME --body -
# ✅ DO — pass as -b argument
gh secret set NAME -b $value
Symptom when this trap fires: Authorization: Bearer xxx\r\n headers, downstream APIs return HTTP 400/401 "Authentication failed". Caught us today during the prevention buildout itself.
Auto-restore can amplify damage if GH secrets are corrupted¶
If the source-of-truth GH secrets are themselves bad, auto-restore PATCHes CF Pages with bad values → production goes from "env vars missing" to "env vars present-but-wrong", which in some cases is harder to detect.
Mitigation in place: auto-restore's diagnostic step (Verify CF API token works from this runner) catches token-shape issues. Post-restore health verify catches broader corruption.
Never use --body - stdin pattern when the value contains literal newlines¶
(Same root cause as #1 — same fix.)
Always run SLA smoke after any operational change¶
Even when a script reports success, verify production health responds 200 with status: healthy. Operational state can drift in subtle ways.
🔧 Daily operational reality¶
On a normal day, none of this fires. The cron runs every 5/10 min, sees healthy, exits silent. KV backup snapshots silently each night. Status page shows green dots.
On a bad day (the 1 May / 3 May class of incident), the chain works automatically:
T+0 Some incident happens (env var wipe, Stripe rotation, Resend outage)
T+0–5 /api/health flips to 'degraded'
T+5 CF Worker watchdog cron tick → ntfy push 🔴 → workflow_dispatch
T+5 GitHub Actions auto-restore.yml starts
T+5–8 PATCH + redeploy + verify
T+8 ntfy push 🟢 — system fully healed
T+10 Customers who paid during the window: their Stripe webhook retry succeeds, email arrives within Stripe's natural retry schedule
T+30 Worst-case customer email arrival (for someone who paid right at T+0)
Without the watchdog (GH Actions cron only):
T+0 Incident happens
T+0–5 degraded
T+5–30 GitHub free-tier cron drifts; eventually fires
T+30+ auto-restore runs
T+33+ system healed
T+60+ worst-case customer email
🔑 Key file paths (single source of truth)¶
| File | Purpose |
|---|---|
functions/guided/api/webhook.ts |
Stripe webhook fulfilment (fail-loud, deterministic key, idempotent) |
functions/guided/api/verify.ts |
Post-redirect verification + same deterministic key |
functions/guided/api/health.ts |
The health endpoint everyone watches |
functions/lib/utils.ts |
deriveLicenceKey, generateLicenceKey, sha256, type defs |
worker/guided-watchdog.mjs |
CF Worker watchdog (5-min cron) |
src/pages/cc.astro |
Command Centre dashboard (single-password) — sales, licences, analytics, search |
src/pages/admin.astro |
Admin login (supports ?return= for post-login redirect) |
scripts/restore-cf-env.ps1 |
One-shot manual restore (when nothing else works) |
scripts/deploy-watchdog.ps1 |
Deploy Worker via CF API |
.github/workflows/payment-health.yml |
GH Actions cron (10 min, backup) |
.github/workflows/auto-restore.yml |
Self-healing workflow |
.github/workflows/post-deploy-smoke.yml |
Post-push validation |
.github/workflows/kv-backup.yml |
Daily KV snapshot |
src/pages/status.astro |
Public status page |
OPERATIONS-RUNBOOK.md (repo root) |
Quick incident response cheat sheet |
🎓 What I learned the hard way (so future-Claude doesn't repeat it)¶
-
Operational maturity is more important than architecture for a paid side-product. All 8 incidents in the lead-up to this buildout were code or process bugs, not architectural ones. The fix was guardrails, not redesign.
-
Single source of truth for secrets is a myth in practice. Secrets live in 3 places (laptop, GH, CF). Having a clean reconciliation script (
restore-cf-env.ps1) and clear ownership of "who reads from where" is the realistic best. -
Free-tier crons drift. Don't bet customer-facing reliability on GitHub Actions schedule. Use a cron platform that's dedicated infra (CF Workers cron, Cloudflare Triggers).
-
Fail-loud > silent skip — always — on a paid product. A 500 that triggers retries is infinitely better than a 200 that loses the customer's email.
-
Webhook idempotency at the email side-effect level matters more than at the KV write level. Use Resend's
Idempotency-Key. Don't re-send emails because of a downstream observability write failure. -
PowerShell stdin and
gh secret setare incompatible. Use-b $valueargument form. Always. -
The diagnostic step is worth its lines.
Verify CF API token works from this runnersaved 30 min of head-scratching when the IP allowlist mystery hit. -
Customer-facing wait time is the only metric that matters in an incident. Detection-to-healed is internal. Purchase-to-email-arrival is what the customer actually feels. Optimise for the latter.
If this doc gets stale, it's because the architecture changed. Update it. Future-you will thank you.