🛡️ Reliability & Redundancy Architecture¶

Last updated: 03 May 2026 (post-Michelle incident, full belt-and-braces buildout)

Audience: future-Sush, future-Claude, anyone investigating "what happens when X breaks"

TL;DR: the Guided platform has 9 independent layers of detection, alerting, self-healing, and recovery. A single env-var wipe used to mean a 5-hour outage and a paid customer messaging on LinkedIn. Now: detected within 5 min, auto-healed within 8 min, customer waits at most one Stripe-retry cycle for their email.

🚨 The incident that birthed this architecture¶

03 May 2026 — Michelle Alexander (wrightalways09@gmail.com) paid $18 for two practice exams (AB-620 + AB-900). She got nothing — no licence keys, no email, no receipt. She found Sush via LinkedIn 5 hours later.

Three simultaneous bugs:

#	Bug	Layer
1	`certCode='guided'` shipped to Stripe metadata	Code (cert-landing slug-extraction)
2	`STRIPE_WEBHOOK_SECRET` wiped from CF Pages env vars	Operational (CF API PATCH side-effect)
3	`RESEND_API_KEY` also wiped	Operational (same wipe class)

Customer impact: $18 paid, zero received, hard reputation hit on a paid product.

Recovery on the day: env vars restored, KV manually populated with correct cert codes + idempotency locks (so Stripe retries see "already processed"), apology email sent with both keys.

Lesson burned in: every single layer of the failure was findable in advance with the right monitoring. None of it was an architectural problem. It was operational maturity debt accumulated faster than it could be paid down. The 9 layers below are the payment.

🏗️ Architecture overview¶

                                    ┌─────────────────────────┐
                                    │  Sush's phone (ntfy.sh) │
                                    └──────────▲──────────────┘
                                               │ push
   ┌──────────────────┐         ┌──────────────┴──────────────┐
   │  Stripe alerts   │────────▶│   GH Actions cron (10 min)  │──┐
   │  (email, opt)    │         │   /api/health probe         │  │
   └──────────────────┘         └─────────────────────────────┘  │
                                               ▲                  │
   ┌──────────────────┐                        │ workflow_dispatch│
   │  CF Worker       │────────────────────────┤                  │
   │  watchdog (5 min)│                        │                  │
   └──────────────────┘                        │                  │ degraded
                                               │                  │
                              ┌────────────────┴────────────────┐ │
                              │ UptimeRobot (5 min, external)   │ │
                              │ → email + their app push        │ │
                              └─────────────────────────────────┘ │
                                                                  │
                              ┌───────────────────────────────────▼┐
                              │   Auto-restore workflow:           │
                              │   1. Pre-check (skip if recovered) │
                              │   2. Verify CF token from runner   │
                              │   3. PATCH CF Pages env vars       │
                              │   4. Trigger redeploy              │
                              │   5. Wait for deploy (max 6 min)   │
                              │   6. Verify health is now green    │
                              │   7. ntfy: 🟢 success / 🚨 failure │
                              └────────────────────────────────────┘

   ┌──────────────────┐         ┌────────────────────────────┐
   │ Pages Functions  │────────▶│   Sentry (when DSN set)    │
   │ webhook/verify/… │         │   stack traces + replay    │
   └──────────────────┘         └────────────────────────────┘

   ┌─────────────────┐          ┌────────────────────────────┐
   │ GUIDED_KV       │─ daily ─▶│   GH Actions artifact      │
   └─────────────────┘          │   gzipped JSON snapshots   │
                                └────────────────────────────┘

   Public:  https://www.aguidetocloud.com/guided/status/
   Synthetic checkout test: every cron tick (catches slug-bug class)
   Post-deploy smoke test: after every push (catches click-flow regressions)

🔌 The 9 layers¶

#	Layer	Cadence	Latency	Auto-fixes?	File
1	Auto-restore workflow	On `workflow_run` failure	~3–4 min	✅ Full self-heal	`.github/workflows/auto-restore.yml`
2	ntfy.sh phone push	On every alert event	seconds	❌ Alert only	embedded in workflows
3	UptimeRobot	5 min, external	~5–10 min	❌ Email alert only	UptimeRobot dashboard
4	Stripe webhook alerts	On Stripe-side failure	~30s	❌ Email alert only	Stripe Dashboard → Webhooks
5	Synthetic checkout test	Every cron tick	<1s	❌ Detection only	`payment-health.yml`
6	Daily KV backup	03:00 UTC daily	<1 min	❌ Recovery aid	`kv-backup.yml`
7	Sentry error tracking	On every Function exception	seconds	❌ Visibility only	(when DSN configured)
8	Public status page	Polls every 60s client-side	seconds	❌ Customer trust	`src/pages/status.astro`
9	CF Worker watchdog	5 min, on CF infra	~5–8 min	✅ Triggers auto-restore	`worker/guided-watchdog.mjs`

Plus the in-code safety nets (Webhook fail-loud, Stripe Idempotency-Key, Stripe 3-day retry window) covered separately below.

🩺 Detection layer — three independent crons¶

The whole system depends on someone noticing degraded health within 5–10 min. We have three independent watchers, each with different reliability characteristics:

9a. CF Worker watchdog (primary)¶

File: worker/guided-watchdog.mjs deployed to CF Workers as a separate Worker (NOT a Pages Function — Pages Functions don't support cron triggers).

Cron: */5 * * * * (every 5 min, on CF's reliable cron infrastructure)
What it does:
fetch /api/health
If status !== 'healthy' → ntfy push (priority=5) + GitHub API workflow_dispatch of auto-restore.yml
Bindings (Worker secrets): NTFY_TOPIC, GH_PAT
Bindings (plain text): GH_REPO, PROD_HEALTH
Endpoints:
GET / — info
GET /__health — proxies prod health (debug)
POST /__trigger — runs full code path manually
Manual test URL: https://guided-watchdog.susanth-ss.workers.dev/__trigger
Deploy: pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1

Why this exists separately from the GH Actions cron: GitHub's free-tier scheduled workflows have 5–30 min drift during busy periods. Today's stress test (03 May 2026) showed the GH cron sat for 13 min after a forced wipe before firing. CF Workers cron is reliable to within ~1 min.

9b. GitHub Actions cron (backup)¶

File: .github/workflows/payment-health.yml

Cron: */10 * * * * (every 10 min)
What it does:
Curls /api/health, fails the run if not "status":"healthy"
Stripe webhook delivery health (counts pending webhooks)
Synthetic checkout — POST to /api/checkout with az-900, retrieve resulting Stripe session, assert metadata.certCode === 'az-900' (catches the slug-bug class)
On failure: ntfy push, opens GitHub issue, also triggers auto-restore.yml via workflow_run event
Drift is ok here because the CF Worker is faster

9c. UptimeRobot (independent third party)¶

External vendor, free tier, 5-min interval, 4 monitors:

Monitor	URL	Type
Health endpoint	`/guided/api/health`	Keyword: `healthy` (alert if missing)
Cert landing	`/guided/az-900/`	HTTP(s) 200
Practice page	`/guided/az-900/practice/`	HTTP(s) 200
Questions JSON	`/guided/data/questions/az-900.json`	HTTP(s) 200

Alert contacts: Email (susanth.ss@gmail.com), optional UptimeRobot mobile app push.

Why three crons? If GH Actions itself is having an outage, the CF Worker still detects. If both are down, UptimeRobot still alerts via email. Three independent failure modes covered.

🩹 Self-healing layer — auto-restore workflow¶

File: .github/workflows/auto-restore.yml

Triggers: - workflow_run on 🏥 Payment Health Monitor completion with failure conclusion - workflow_dispatch (manual or via CF Worker watchdog API call)

Steps:

Re-check health — avoid PATCHing on transient blips
Verify CF API token works from this runner — diagnostic, fails fast on IP allowlist or whitespace-corrupted tokens. Sends ntfy push with the dashboard URL to fix it.
PATCH Cloudflare Pages env vars with all 7 secrets from GH repo secrets
Trigger redeploy so Functions reload with the new values
Wait for deploy (max 6 min)
Verify health is now green — fail-loud if not
Phone push — 🟢 on success, 🚨 + GitHub issue on failure

Mirrored secrets (GH repo secrets → CF env vars):

ADMIN_PASSWORD          ← guided-admin-password
CLOUDFLARE_API_TOKEN    ← cloudflare-api-token  (used to PATCH itself)
RESEND_API_KEY          ← resend-api-key
STRIPE_PUBLISHABLE_KEY  ← stripe-live-publishable-key
STRIPE_SECRET_KEY       ← stripe-live-secret-key
STRIPE_WEBHOOK_SECRET   ← stripe-live-webhook-secret
TOKEN_SECRET            ← guided-token-secret
NTFY_TOPIC              (only used in workflows, not pushed to CF)
PERSONAL_PAT            (legacy, not used after switching to github.token)
STRIPE_LIVE_SECRET_KEY  (alias of STRIPE_SECRET_KEY for the webhook-pending check)

Trust model: GH repo secrets and CF env vars hold the same values. If either account is compromised, the system is exposed. Same trust surface either way; the redundancy is for availability, not security.

📦 Code-level safety nets¶

These are inside functions/guided/api/*.ts — they make the system resilient to env-var loss, race conditions, and transient failures.

Deterministic licence keys¶

File: functions/lib/utils.ts — deriveLicenceKey(sessionId, secret)

HMAC-SHA256(TOKEN_SECRET, session.id) → 12 chars from "ABCDEFGHJKLMNPQRSTUVWXYZ23456789"
                                      → "GD-XXXX-XXXX-XXXX"

Both the webhook and /api/verify derive the same key from the same session.id. No more split-brain (where verify mints a random key A, webhook mints random key B, customer ends up with two records).

Backward-compat: webhook checks session:${id} first; if a pre-deterministic random key already exists for that session, it respects it.

Fail-loud webhook¶

File: functions/guided/api/webhook.ts

Returns 500 (not silent 200) on: - Missing email in paid session - Missing productType metadata - RESEND_API_KEY not set - Resend API returns non-200 - Resend fetch throws

Why 500? Stripe retries failed webhook deliveries for 3 days. Fail-loud means a customer's email path is eventually consistent with system health — the system breaks → operations notice → operations fix → Stripe's next retry succeeds → customer gets email.

The OLD behaviour was silent 200, no retry, no alert, customer silently lost. That's the Michelle case.

Resend Idempotency-Key¶

'Idempotency-Key': `guided-webhook-${session.id}`

Resend deduplicates retries with the same key. Even if the post-send resendEmailId KV write fails (network blip), Stripe retry → Resend dedup → no duplicate email to customer.

KV write order: licence first, session lookup is the commit point¶

1. PUT licence:KEY (record)          ← ok if this fails alone — retry will recreate
2. PUT session:SID → KEY (lookup)    ← THIS is the "commit" — once written, retry sees alreadyProcessed
3. PUT email:HASH → [keys]           ← best-effort backfill on retry path
4. Send email                        ← fail-loud, 500 → Stripe retries
5. PUT licence:KEY (with resendEmailId stamped)  ← observability only, not gating

If we crash anywhere, the next retry recovers cleanly because the licence key is deterministic.

📊 Public status page¶

File: src/pages/status.astro → https://www.aguidetocloud.com/guided/status/

Polls /api/health every 60s client-side. Shows colored dot per component (KV, Resend, Stripe, env vars). No backend dependency beyond the same health endpoint our monitors use.

Customer trust + at-a-glance ops view from any device.

🧪 Test layer — what catches what¶

Test	Runs when	Catches
Pre-push: `test-guided-qa.cjs`	Manual before pushing PracticeQuiz changes	React hooks violations, option-text rendering, click flow, checkout flow
Post-deploy smoke	After every push to main (with 3-min wait for CF deploy)	dataUrl path drift, click-flow regressions, cert-unlock-btn presence
Synthetic checkout	Every payment-health cron tick	`certCode='guided'` slug bug regression
Health endpoint self-test	Every cron tick	Env var wipes, Stripe API failure, Resend API failure, KV failure
CF Worker watchdog	Every 5 min	Same as health, but with reliable cron

💾 Recovery — when things go very wrong¶

Env vars wiped (cron will catch it; here's the manual one-liner)¶

pwsh C:\ssClawy\guided\scripts\restore-cf-env.ps1

Reads from ~/.copilot/secrets/, PATCHes CF Pages with all 7 secrets, triggers redeploy, runs SLA smoke. Exit codes: 0 ok, 1 patch failed, 2 deploy failed, 3 SLA smoke failed.

KV corrupted / accidentally deleted¶

Download latest artifact from Actions → Daily KV Backup
Unzip → JSON file with all keys + values + expiration timestamps
PUT each key back via CF KV API (script TBD when needed — has not happened yet)

A customer paid but didn't get email (the Michelle scenario)¶

If the webhook is returning 500 and Stripe is in retry mode, fix the underlying issue (run restore-cf-env.ps1) — Stripe will retry within 3-day window and email will arrive automatically.

If retries are exhausted (>3 days):

# Find the customer's Stripe session
$stripeKey = (Get-Content "$env:USERPROFILE\.copilot\secrets\stripe-live-secret-key" -Raw).Trim()
curl.exe -s -G "https://api.stripe.com/v1/checkout/sessions?limit=25" -u "${stripeKey}:" | ConvertFrom-Json

# Manually generate licence + write to KV + send email via Resend
# Pattern: see the actions taken in the 03 May 2026 incident response.

CF Worker watchdog stops firing¶

Check: 1. Worker is deployed: https://guided-watchdog.susanth-ss.workers.dev returns 200 2. Cron is registered: CF Dashboard → Workers → guided-watchdog → Triggers 3. Worker secrets are set: CF Dashboard → Workers → guided-watchdog → Settings → Variables 4. Test manually: curl -X POST https://guided-watchdog.susanth-ss.workers.dev/__trigger

Re-deploy: pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1 (idempotent).

📜 Alert flow — what your phone shows¶

Trigger	Latency	Phone push title
CF Worker watchdog detects degraded	~5 min	🔴 "Watchdog: Guided degraded — auto-restore dispatching"
GH cron detects degraded (backup)	~10–30 min	🔴 "Guided: payment health degraded"
Auto-restore starts	within 1 min of dispatch	🔴 "Guided: degraded — auto-restoring now"
Auto-restore wins	~3–4 min after start	🟢 "Guided: auto-restored successfully"
Auto-restore fails	within 1 min of failure	🚨 "Guided: AUTO-RESTORE FAILED" + GitHub issue opened
Post-deploy smoke fails	within 30s of detection	🚨 "Guided: post-deploy smoke FAILED" + GitHub issue
KV backup fails	within 30s of detection	⚠️ "Guided: KV backup FAILED" (non-urgent)
UptimeRobot detects 5xx	5–10 min	Email to your Gmail
Stripe webhook delivery fails (when enabled)	~30s	Email from Stripe

ntfy topic: guided-alerts-mKTnVVZhHcGA (saved at ~/.copilot/secrets/guided-ntfy-topic).

🪤 Process traps to avoid (lessons in scar tissue)¶

Never pipe secrets via PowerShell stdin to gh CLI¶

# ❌ DO NOT — appends a CRLF, corrupts the stored value
$value | gh secret set NAME --body -

# ✅ DO — pass as -b argument
gh secret set NAME -b $value

Symptom when this trap fires: Authorization: Bearer xxx\r\n headers, downstream APIs return HTTP 400/401 "Authentication failed". Caught us today during the prevention buildout itself.

Auto-restore can amplify damage if GH secrets are corrupted¶

If the source-of-truth GH secrets are themselves bad, auto-restore PATCHes CF Pages with bad values → production goes from "env vars missing" to "env vars present-but-wrong", which in some cases is harder to detect.

Mitigation in place: auto-restore's diagnostic step (Verify CF API token works from this runner) catches token-shape issues. Post-restore health verify catches broader corruption.

Never use `--body -` stdin pattern when the value contains literal newlines¶

(Same root cause as #1 — same fix.)

Always run SLA smoke after any operational change¶

Even when a script reports success, verify production health responds 200 with status: healthy. Operational state can drift in subtle ways.

🔧 Daily operational reality¶

On a normal day, none of this fires. The cron runs every 5/10 min, sees healthy, exits silent. KV backup snapshots silently each night. Status page shows green dots.

On a bad day (the 1 May / 3 May class of incident), the chain works automatically:

T+0     Some incident happens (env var wipe, Stripe rotation, Resend outage)
T+0–5   /api/health flips to 'degraded'
T+5     CF Worker watchdog cron tick → ntfy push 🔴 → workflow_dispatch
T+5     GitHub Actions auto-restore.yml starts
T+5–8   PATCH + redeploy + verify
T+8     ntfy push 🟢 — system fully healed
T+10    Customers who paid during the window: their Stripe webhook retry succeeds, email arrives within Stripe's natural retry schedule
T+30    Worst-case customer email arrival (for someone who paid right at T+0)

Without the watchdog (GH Actions cron only):

T+0     Incident happens
T+0–5   degraded
T+5–30  GitHub free-tier cron drifts; eventually fires
T+30+   auto-restore runs
T+33+   system healed
T+60+   worst-case customer email

🔑 Key file paths (single source of truth)¶

File	Purpose
`functions/guided/api/webhook.ts`	Stripe webhook fulfilment (fail-loud, deterministic key, idempotent)
`functions/guided/api/verify.ts`	Post-redirect verification + same deterministic key
`functions/guided/api/health.ts`	The health endpoint everyone watches
`functions/lib/utils.ts`	`deriveLicenceKey`, `generateLicenceKey`, `sha256`, type defs
`worker/guided-watchdog.mjs`	CF Worker watchdog (5-min cron)
`src/pages/cc.astro`	Command Centre dashboard (single-password) — sales, licences, analytics, search
`src/pages/admin.astro`	Admin login (supports `?return=` for post-login redirect)
`scripts/restore-cf-env.ps1`	One-shot manual restore (when nothing else works)
`scripts/deploy-watchdog.ps1`	Deploy Worker via CF API
`.github/workflows/payment-health.yml`	GH Actions cron (10 min, backup)
`.github/workflows/auto-restore.yml`	Self-healing workflow
`.github/workflows/post-deploy-smoke.yml`	Post-push validation
`.github/workflows/kv-backup.yml`	Daily KV snapshot
`src/pages/status.astro`	Public status page
`OPERATIONS-RUNBOOK.md` (repo root)	Quick incident response cheat sheet

🎓 What I learned the hard way (so future-Claude doesn't repeat it)¶

Operational maturity is more important than architecture for a paid side-product. All 8 incidents in the lead-up to this buildout were code or process bugs, not architectural ones. The fix was guardrails, not redesign.
Single source of truth for secrets is a myth in practice. Secrets live in 3 places (laptop, GH, CF). Having a clean reconciliation script (restore-cf-env.ps1) and clear ownership of "who reads from where" is the realistic best.
Free-tier crons drift. Don't bet customer-facing reliability on GitHub Actions schedule. Use a cron platform that's dedicated infra (CF Workers cron, Cloudflare Triggers).
Fail-loud > silent skip — always — on a paid product. A 500 that triggers retries is infinitely better than a 200 that loses the customer's email.
Webhook idempotency at the email side-effect level matters more than at the KV write level. Use Resend's Idempotency-Key. Don't re-send emails because of a downstream observability write failure.
PowerShell stdin and gh secret set are incompatible. Use -b $value argument form. Always.
The diagnostic step is worth its lines. Verify CF API token works from this runner saved 30 min of head-scratching when the IP allowlist mystery hit.
Customer-facing wait time is the only metric that matters in an incident. Detection-to-healed is internal. Purchase-to-email-arrival is what the customer actually feels. Optimise for the latter.

If this doc gets stale, it's because the architecture changed. Update it. Future-you will thank you.