Skip to content

🛡️ Reliability & Redundancy Architecture

Last updated: 03 May 2026 (post-Michelle incident, full belt-and-braces buildout)

Audience: future-Sush, future-Claude, anyone investigating "what happens when X breaks"

TL;DR: the Guided platform has 9 independent layers of detection, alerting, self-healing, and recovery. A single env-var wipe used to mean a 5-hour outage and a paid customer messaging on LinkedIn. Now: detected within 5 min, auto-healed within 8 min, customer waits at most one Stripe-retry cycle for their email.


🚨 The incident that birthed this architecture

03 May 2026 — Michelle Alexander (wrightalways09@gmail.com) paid $18 for two practice exams (AB-620 + AB-900). She got nothing — no licence keys, no email, no receipt. She found Sush via LinkedIn 5 hours later.

Three simultaneous bugs:

# Bug Layer
1 certCode='guided' shipped to Stripe metadata Code (cert-landing slug-extraction)
2 STRIPE_WEBHOOK_SECRET wiped from CF Pages env vars Operational (CF API PATCH side-effect)
3 RESEND_API_KEY also wiped Operational (same wipe class)

Customer impact: $18 paid, zero received, hard reputation hit on a paid product.

Recovery on the day: env vars restored, KV manually populated with correct cert codes + idempotency locks (so Stripe retries see "already processed"), apology email sent with both keys.

Lesson burned in: every single layer of the failure was findable in advance with the right monitoring. None of it was an architectural problem. It was operational maturity debt accumulated faster than it could be paid down. The 9 layers below are the payment.


🏗️ Architecture overview

                                    ┌─────────────────────────┐
                                    │  Sush's phone (ntfy.sh) │
                                    └──────────▲──────────────┘
                                               │ push
   ┌──────────────────┐         ┌──────────────┴──────────────┐
   │  Stripe alerts   │────────▶│   GH Actions cron (10 min)  │──┐
   │  (email, opt)    │         │   /api/health probe         │  │
   └──────────────────┘         └─────────────────────────────┘  │
                                               ▲                  │
   ┌──────────────────┐                        │ workflow_dispatch│
   │  CF Worker       │────────────────────────┤                  │
   │  watchdog (5 min)│                        │                  │
   └──────────────────┘                        │                  │ degraded
                                               │                  │
                              ┌────────────────┴────────────────┐ │
                              │ UptimeRobot (5 min, external)   │ │
                              │ → email + their app push        │ │
                              └─────────────────────────────────┘ │
                              ┌───────────────────────────────────▼┐
                              │   Auto-restore workflow:           │
                              │   1. Pre-check (skip if recovered) │
                              │   2. Verify CF token from runner   │
                              │   3. PATCH CF Pages env vars       │
                              │   4. Trigger redeploy              │
                              │   5. Wait for deploy (max 6 min)   │
                              │   6. Verify health is now green    │
                              │   7. ntfy: 🟢 success / 🚨 failure │
                              └────────────────────────────────────┘

   ┌──────────────────┐         ┌────────────────────────────┐
   │ Pages Functions  │────────▶│   Sentry (when DSN set)    │
   │ webhook/verify/… │         │   stack traces + replay    │
   └──────────────────┘         └────────────────────────────┘

   ┌─────────────────┐          ┌────────────────────────────┐
   │ GUIDED_KV       │─ daily ─▶│   GH Actions artifact      │
   └─────────────────┘          │   gzipped JSON snapshots   │
                                └────────────────────────────┘

   Public:  https://www.aguidetocloud.com/guided/status/
   Synthetic checkout test: every cron tick (catches slug-bug class)
   Post-deploy smoke test: after every push (catches click-flow regressions)

🔌 The 9 layers

# Layer Cadence Latency Auto-fixes? File
1 Auto-restore workflow On workflow_run failure ~3–4 min ✅ Full self-heal .github/workflows/auto-restore.yml
2 ntfy.sh phone push On every alert event seconds ❌ Alert only embedded in workflows
3 UptimeRobot 5 min, external ~5–10 min ❌ Email alert only UptimeRobot dashboard
4 Stripe webhook alerts On Stripe-side failure ~30s ❌ Email alert only Stripe Dashboard → Webhooks
5 Synthetic checkout test Every cron tick <1s ❌ Detection only payment-health.yml
6 Daily KV backup 03:00 UTC daily <1 min ❌ Recovery aid kv-backup.yml
7 Sentry error tracking On every Function exception seconds ❌ Visibility only (when DSN configured)
8 Public status page Polls every 60s client-side seconds ❌ Customer trust src/pages/status.astro
9 CF Worker watchdog 5 min, on CF infra ~5–8 min ✅ Triggers auto-restore worker/guided-watchdog.mjs

Plus the in-code safety nets (Webhook fail-loud, Stripe Idempotency-Key, Stripe 3-day retry window) covered separately below.


🩺 Detection layer — three independent crons

The whole system depends on someone noticing degraded health within 5–10 min. We have three independent watchers, each with different reliability characteristics:

9a. CF Worker watchdog (primary)

File: worker/guided-watchdog.mjs deployed to CF Workers as a separate Worker (NOT a Pages Function — Pages Functions don't support cron triggers).

  • Cron: */5 * * * * (every 5 min, on CF's reliable cron infrastructure)
  • What it does:
  • fetch /api/health
  • If status !== 'healthy' → ntfy push (priority=5) + GitHub API workflow_dispatch of auto-restore.yml
  • Bindings (Worker secrets): NTFY_TOPIC, GH_PAT
  • Bindings (plain text): GH_REPO, PROD_HEALTH
  • Endpoints:
  • GET / — info
  • GET /__health — proxies prod health (debug)
  • POST /__trigger — runs full code path manually
  • Manual test URL: https://guided-watchdog.susanth-ss.workers.dev/__trigger
  • Deploy: pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1

Why this exists separately from the GH Actions cron: GitHub's free-tier scheduled workflows have 5–30 min drift during busy periods. Today's stress test (03 May 2026) showed the GH cron sat for 13 min after a forced wipe before firing. CF Workers cron is reliable to within ~1 min.

9b. GitHub Actions cron (backup)

File: .github/workflows/payment-health.yml

  • Cron: */10 * * * * (every 10 min)
  • What it does:
  • Curls /api/health, fails the run if not "status":"healthy"
  • Stripe webhook delivery health (counts pending webhooks)
  • Synthetic checkout — POST to /api/checkout with az-900, retrieve resulting Stripe session, assert metadata.certCode === 'az-900' (catches the slug-bug class)
  • On failure: ntfy push, opens GitHub issue, also triggers auto-restore.yml via workflow_run event
  • Drift is ok here because the CF Worker is faster

9c. UptimeRobot (independent third party)

External vendor, free tier, 5-min interval, 4 monitors:

Monitor URL Type
Health endpoint /guided/api/health Keyword: healthy (alert if missing)
Cert landing /guided/az-900/ HTTP(s) 200
Practice page /guided/az-900/practice/ HTTP(s) 200
Questions JSON /guided/data/questions/az-900.json HTTP(s) 200

Alert contacts: Email (susanth.ss@gmail.com), optional UptimeRobot mobile app push.

Why three crons? If GH Actions itself is having an outage, the CF Worker still detects. If both are down, UptimeRobot still alerts via email. Three independent failure modes covered.


🩹 Self-healing layer — auto-restore workflow

File: .github/workflows/auto-restore.yml

Triggers: - workflow_run on 🏥 Payment Health Monitor completion with failure conclusion - workflow_dispatch (manual or via CF Worker watchdog API call)

Steps:

  1. Re-check health — avoid PATCHing on transient blips
  2. Verify CF API token works from this runner — diagnostic, fails fast on IP allowlist or whitespace-corrupted tokens. Sends ntfy push with the dashboard URL to fix it.
  3. PATCH Cloudflare Pages env vars with all 7 secrets from GH repo secrets
  4. Trigger redeploy so Functions reload with the new values
  5. Wait for deploy (max 6 min)
  6. Verify health is now green — fail-loud if not
  7. Phone push — 🟢 on success, 🚨 + GitHub issue on failure

Mirrored secrets (GH repo secrets → CF env vars):

ADMIN_PASSWORD          ← guided-admin-password
CLOUDFLARE_API_TOKEN    ← cloudflare-api-token  (used to PATCH itself)
RESEND_API_KEY          ← resend-api-key
STRIPE_PUBLISHABLE_KEY  ← stripe-live-publishable-key
STRIPE_SECRET_KEY       ← stripe-live-secret-key
STRIPE_WEBHOOK_SECRET   ← stripe-live-webhook-secret
TOKEN_SECRET            ← guided-token-secret
NTFY_TOPIC              (only used in workflows, not pushed to CF)
PERSONAL_PAT            (legacy, not used after switching to github.token)
STRIPE_LIVE_SECRET_KEY  (alias of STRIPE_SECRET_KEY for the webhook-pending check)

Trust model: GH repo secrets and CF env vars hold the same values. If either account is compromised, the system is exposed. Same trust surface either way; the redundancy is for availability, not security.


📦 Code-level safety nets

These are inside functions/guided/api/*.ts — they make the system resilient to env-var loss, race conditions, and transient failures.

Deterministic licence keys

File: functions/lib/utils.tsderiveLicenceKey(sessionId, secret)

HMAC-SHA256(TOKEN_SECRET, session.id)  12 chars from "ABCDEFGHJKLMNPQRSTUVWXYZ23456789"
                                       "GD-XXXX-XXXX-XXXX"

Both the webhook and /api/verify derive the same key from the same session.id. No more split-brain (where verify mints a random key A, webhook mints random key B, customer ends up with two records).

Backward-compat: webhook checks session:${id} first; if a pre-deterministic random key already exists for that session, it respects it.

Fail-loud webhook

File: functions/guided/api/webhook.ts

Returns 500 (not silent 200) on: - Missing email in paid session - Missing productType metadata - RESEND_API_KEY not set - Resend API returns non-200 - Resend fetch throws

Why 500? Stripe retries failed webhook deliveries for 3 days. Fail-loud means a customer's email path is eventually consistent with system health — the system breaks → operations notice → operations fix → Stripe's next retry succeeds → customer gets email.

The OLD behaviour was silent 200, no retry, no alert, customer silently lost. That's the Michelle case.

Resend Idempotency-Key

'Idempotency-Key': `guided-webhook-${session.id}`

Resend deduplicates retries with the same key. Even if the post-send resendEmailId KV write fails (network blip), Stripe retry → Resend dedup → no duplicate email to customer.

KV write order: licence first, session lookup is the commit point

1. PUT licence:KEY (record)          ← ok if this fails alone — retry will recreate
2. PUT session:SID → KEY (lookup)    ← THIS is the "commit" — once written, retry sees alreadyProcessed
3. PUT email:HASH → [keys]           ← best-effort backfill on retry path
4. Send email                        ← fail-loud, 500 → Stripe retries
5. PUT licence:KEY (with resendEmailId stamped)  ← observability only, not gating

If we crash anywhere, the next retry recovers cleanly because the licence key is deterministic.


📊 Public status page

File: src/pages/status.astrohttps://www.aguidetocloud.com/guided/status/

Polls /api/health every 60s client-side. Shows colored dot per component (KV, Resend, Stripe, env vars). No backend dependency beyond the same health endpoint our monitors use.

Customer trust + at-a-glance ops view from any device.


🧪 Test layer — what catches what

Test Runs when Catches
Pre-push: test-guided-qa.cjs Manual before pushing PracticeQuiz changes React hooks violations, option-text rendering, click flow, checkout flow
Post-deploy smoke After every push to main (with 3-min wait for CF deploy) dataUrl path drift, click-flow regressions, cert-unlock-btn presence
Synthetic checkout Every payment-health cron tick certCode='guided' slug bug regression
Health endpoint self-test Every cron tick Env var wipes, Stripe API failure, Resend API failure, KV failure
CF Worker watchdog Every 5 min Same as health, but with reliable cron

💾 Recovery — when things go very wrong

Env vars wiped (cron will catch it; here's the manual one-liner)

pwsh C:\ssClawy\guided\scripts\restore-cf-env.ps1

Reads from ~/.copilot/secrets/, PATCHes CF Pages with all 7 secrets, triggers redeploy, runs SLA smoke. Exit codes: 0 ok, 1 patch failed, 2 deploy failed, 3 SLA smoke failed.

KV corrupted / accidentally deleted

  1. Download latest artifact from Actions → Daily KV Backup
  2. Unzip → JSON file with all keys + values + expiration timestamps
  3. PUT each key back via CF KV API (script TBD when needed — has not happened yet)

A customer paid but didn't get email (the Michelle scenario)

If the webhook is returning 500 and Stripe is in retry mode, fix the underlying issue (run restore-cf-env.ps1) — Stripe will retry within 3-day window and email will arrive automatically.

If retries are exhausted (>3 days):

# Find the customer's Stripe session
$stripeKey = (Get-Content "$env:USERPROFILE\.copilot\secrets\stripe-live-secret-key" -Raw).Trim()
curl.exe -s -G "https://api.stripe.com/v1/checkout/sessions?limit=25" -u "${stripeKey}:" | ConvertFrom-Json

# Manually generate licence + write to KV + send email via Resend
# Pattern: see the actions taken in the 03 May 2026 incident response.

CF Worker watchdog stops firing

Check: 1. Worker is deployed: https://guided-watchdog.susanth-ss.workers.dev returns 200 2. Cron is registered: CF Dashboard → Workers → guided-watchdog → Triggers 3. Worker secrets are set: CF Dashboard → Workers → guided-watchdog → Settings → Variables 4. Test manually: curl -X POST https://guided-watchdog.susanth-ss.workers.dev/__trigger

Re-deploy: pwsh C:\ssClawy\guided\scripts\deploy-watchdog.ps1 (idempotent).


📜 Alert flow — what your phone shows

Trigger Latency Phone push title
CF Worker watchdog detects degraded ~5 min 🔴 "Watchdog: Guided degraded — auto-restore dispatching"
GH cron detects degraded (backup) ~10–30 min 🔴 "Guided: payment health degraded"
Auto-restore starts within 1 min of dispatch 🔴 "Guided: degraded — auto-restoring now"
Auto-restore wins ~3–4 min after start 🟢 "Guided: auto-restored successfully"
Auto-restore fails within 1 min of failure 🚨 "Guided: AUTO-RESTORE FAILED" + GitHub issue opened
Post-deploy smoke fails within 30s of detection 🚨 "Guided: post-deploy smoke FAILED" + GitHub issue
KV backup fails within 30s of detection ⚠️ "Guided: KV backup FAILED" (non-urgent)
UptimeRobot detects 5xx 5–10 min Email to your Gmail
Stripe webhook delivery fails (when enabled) ~30s Email from Stripe

ntfy topic: guided-alerts-mKTnVVZhHcGA (saved at ~/.copilot/secrets/guided-ntfy-topic).


🪤 Process traps to avoid (lessons in scar tissue)

Never pipe secrets via PowerShell stdin to gh CLI

# ❌ DO NOT — appends a CRLF, corrupts the stored value
$value | gh secret set NAME --body -

# ✅ DO — pass as -b argument
gh secret set NAME -b $value

Symptom when this trap fires: Authorization: Bearer xxx\r\n headers, downstream APIs return HTTP 400/401 "Authentication failed". Caught us today during the prevention buildout itself.

Auto-restore can amplify damage if GH secrets are corrupted

If the source-of-truth GH secrets are themselves bad, auto-restore PATCHes CF Pages with bad values → production goes from "env vars missing" to "env vars present-but-wrong", which in some cases is harder to detect.

Mitigation in place: auto-restore's diagnostic step (Verify CF API token works from this runner) catches token-shape issues. Post-restore health verify catches broader corruption.

Never use --body - stdin pattern when the value contains literal newlines

(Same root cause as #1 — same fix.)

Always run SLA smoke after any operational change

Even when a script reports success, verify production health responds 200 with status: healthy. Operational state can drift in subtle ways.


🔧 Daily operational reality

On a normal day, none of this fires. The cron runs every 5/10 min, sees healthy, exits silent. KV backup snapshots silently each night. Status page shows green dots.

On a bad day (the 1 May / 3 May class of incident), the chain works automatically:

T+0     Some incident happens (env var wipe, Stripe rotation, Resend outage)
T+0–5   /api/health flips to 'degraded'
T+5     CF Worker watchdog cron tick → ntfy push 🔴 → workflow_dispatch
T+5     GitHub Actions auto-restore.yml starts
T+5–8   PATCH + redeploy + verify
T+8     ntfy push 🟢 — system fully healed
T+10    Customers who paid during the window: their Stripe webhook retry succeeds, email arrives within Stripe's natural retry schedule
T+30    Worst-case customer email arrival (for someone who paid right at T+0)

Without the watchdog (GH Actions cron only):

T+0     Incident happens
T+0–5   degraded
T+5–30  GitHub free-tier cron drifts; eventually fires
T+30+   auto-restore runs
T+33+   system healed
T+60+   worst-case customer email

🔑 Key file paths (single source of truth)

File Purpose
functions/guided/api/webhook.ts Stripe webhook fulfilment (fail-loud, deterministic key, idempotent)
functions/guided/api/verify.ts Post-redirect verification + same deterministic key
functions/guided/api/health.ts The health endpoint everyone watches
functions/lib/utils.ts deriveLicenceKey, generateLicenceKey, sha256, type defs
worker/guided-watchdog.mjs CF Worker watchdog (5-min cron)
src/pages/cc.astro Command Centre dashboard (single-password) — sales, licences, analytics, search
src/pages/admin.astro Admin login (supports ?return= for post-login redirect)
scripts/restore-cf-env.ps1 One-shot manual restore (when nothing else works)
scripts/deploy-watchdog.ps1 Deploy Worker via CF API
.github/workflows/payment-health.yml GH Actions cron (10 min, backup)
.github/workflows/auto-restore.yml Self-healing workflow
.github/workflows/post-deploy-smoke.yml Post-push validation
.github/workflows/kv-backup.yml Daily KV snapshot
src/pages/status.astro Public status page
OPERATIONS-RUNBOOK.md (repo root) Quick incident response cheat sheet

🎓 What I learned the hard way (so future-Claude doesn't repeat it)

  1. Operational maturity is more important than architecture for a paid side-product. All 8 incidents in the lead-up to this buildout were code or process bugs, not architectural ones. The fix was guardrails, not redesign.

  2. Single source of truth for secrets is a myth in practice. Secrets live in 3 places (laptop, GH, CF). Having a clean reconciliation script (restore-cf-env.ps1) and clear ownership of "who reads from where" is the realistic best.

  3. Free-tier crons drift. Don't bet customer-facing reliability on GitHub Actions schedule. Use a cron platform that's dedicated infra (CF Workers cron, Cloudflare Triggers).

  4. Fail-loud > silent skip — always — on a paid product. A 500 that triggers retries is infinitely better than a 200 that loses the customer's email.

  5. Webhook idempotency at the email side-effect level matters more than at the KV write level. Use Resend's Idempotency-Key. Don't re-send emails because of a downstream observability write failure.

  6. PowerShell stdin and gh secret set are incompatible. Use -b $value argument form. Always.

  7. The diagnostic step is worth its lines. Verify CF API token works from this runner saved 30 min of head-scratching when the IP allowlist mystery hit.

  8. Customer-facing wait time is the only metric that matters in an incident. Detection-to-healed is internal. Purchase-to-email-arrival is what the customer actually feels. Optimise for the latter.


If this doc gets stale, it's because the architecture changed. Update it. Future-you will thank you.