labs | 13 | ship + prove
lab 13 | ~10 min | masterclass | the finish line

A green build is not a correct deploy. Prove it live.

Everything in the earlier labs was so an agent could act without you watching. That only works if the agent can prove its work actually happened and cannot accidentally do it twice. This is not polish. We hide a unique marker in the build (a sentinel) and keep checking the live URL until that exact marker shows up. We run the real user path and fail loudly if the output is wrong (a smoke test). And we give every important write a key that makes running it twice safe, so a retry never fires the action again (this is idempotency). Proving it works before calling it ready, and making writes safe to repeat, are what earn an agent the right to act on its own.

step 1

Deploy, then verify with a sentinel.

The trap: a deploy command reports success while the CDN (the network of edge servers that caches and serves your site) is still handing out the old build. "Success" means the platform accepted your files, not that visitors see them yet. So you put a unique BUILD_ID inside the build itself, then keep checking the live URL until that exact id shows up. Watch the ETag change too: it is a version tag the server sends so you can tell when the content has actually changed. Vercel is the example here, but the idea works anywhere: any URL, any marker. Stamp the id in at build time.

build step | stamp a sentinel into the output
# Generate a unique id per build and write it where the live page will serve it.
# (Here: a meta tag in index.html and a plain /build-id.txt next to it.)
BUILD_ID="build-$(date +%s)-$(git rev-parse --short HEAD 2>/dev/null || echo nogit)"
echo "stamping $BUILD_ID"

# index.html gets a machine-readable marker the poller can grep for.
printf '<meta name="x-build-id" content="%s">\n' "$BUILD_ID" >> dist/index.html
printf '%s\n' "$BUILD_ID" > dist/build-id.txt

# ... then ship it (Vercel here; swap for your deploy target) ...
# vercel deploy --prod --yes
echo "$BUILD_ID"   # hand this exact value to the poller below

Now check in a loop. The checker knows nothing about the platform. It just asks the public URL for the marker and tries again, waiting a bit longer between tries, until the marker appears or it hits a firm limit and gives up. There is a bash version for mac and Linux and a PowerShell version for Windows, so the same check runs on either machine.

poll.sh | mac / linux
#!/usr/bin/env bash
# Usage: ./poll.sh https://your-app.vercel.app build-1717...-a1b2c3
set -euo pipefail
URL="$1"; WANT="$2"; MAX=20; n=0
while [ "$n" -lt "$MAX" ]; do
  n=$((n+1))
  # -s body, plus the ETag so we can see the edge object change.
  body="$(curl -fsSL "$URL" || true)"
  etag="$(curl -fsSI "$URL" | tr -d '\r' \
          | awk -F': ' 'tolower($1)=="etag"{print $2}')"
  if printf '%s' "$body" | grep -qF "$WANT"; then
    echo "live: $WANT after $n polls (etag $etag)"
    exit 0
  fi
  echo "poll $n/$MAX: sentinel not live yet (etag $etag), sleeping"
  sleep 6
done
echo "FAIL: $WANT never went live after $MAX polls" >&2
exit 1
poll.ps1 | windows
# Usage: .\poll.ps1 https://your-app.vercel.app build-1717...-a1b2c3
param([string]$Url, [string]$Want, [int]$Max = 20)
$ErrorActionPreference = "Stop"
for ($n = 1; $n -le $Max; $n++) {
  try {
    $r = Invoke-WebRequest -Uri $Url -UseBasicParsing
    $etag = $r.Headers["ETag"]
    if ($r.Content -like "*$Want*") {
      Write-Host "live: $Want after $n polls (etag $etag)"
      exit 0
    }
    Write-Host "poll $n/$Max`: sentinel not live yet (etag $etag), sleeping"
  } catch {
    Write-Host "poll $n/$Max`: request failed, retrying"
  }
  Start-Sleep -Seconds 6
}
Write-Error "FAIL: $Want never went live after $Max polls"
exit 1
what a passing run prints
poll 1/20: sentinel not live yet (etag W/"a1b2-old"), sleeping
poll 2/20: sentinel not live yet (etag W/"a1b2-old"), sleeping
poll 3/20: sentinel not live yet (etag W/"7f3c-new"), sleeping
build-1717-abc123 live after 4 polls (etag W/"7f3c-new")
exit 0

The exit code is the promise. (An exit code is the pass/fail number a script leaves behind when it ends; 0 means success.) Here 0 means the thing you built is the thing the internet now serves. Anything else means do not announce yet. An agent waits on this, not on the deploy command's "success."

step 2

Smoke the real flow.

A response of 200 (the HTTP code for "OK") does not mean correct. A page can return 200 and still show an error. An MCP server (a small server that exposes tools an AI agent can call) can return 200 and be missing half its tools. So you walk the actual user path and check for the output you expect, then exit with a failure code on any mismatch so an agent (or an automated build) can stop on it. For a page, check that a known string is there. For an MCP server, call tools/list (the request that asks a server which tools it offers) and check that every tool name you expect is present.

smoke.py | assert the real output, exit non-zero on miss
#!/usr/bin/env python3
"""Smoke the live deploy. Exits 0 only if the real flow produces the
real output. Generic: a page-string check and an MCP tools/list check."""
import sys, json, urllib.request

def fetch(url, data=None, headers=None):
    req = urllib.request.Request(
        url, data=data, headers=headers or {}, method="POST" if data else "GET")
    with urllib.request.urlopen(req, timeout=15) as r:
        return r.status, r.read().decode("utf-8", "replace")

def check_page(url, must_contain):
    status, body = fetch(url)
    ok = status == 200 and must_contain in body
    print(("PASS" if ok else "FAIL"), "page", url,
          "->", status, repr(must_contain), "present" if ok else "MISSING")
    return ok

def check_mcp_tools(mcp_url, expected):
    # One framed JSON-RPC call. tools/list must return every name we expect.
    payload = json.dumps({
        "jsonrpc": "2.0", "id": 1, "method": "tools/list"}).encode()
    status, body = fetch(mcp_url, data=payload,
                         headers={"Content-Type": "application/json"})
    try:
        names = {t["name"] for t in json.loads(body)["result"]["tools"]}
    except Exception:
        names = set()
    missing = set(expected) - names
    ok = status == 200 and not missing
    print(("PASS" if ok else "FAIL"), "mcp", mcp_url,
          "-> got", sorted(names), "missing", sorted(missing))
    return ok

if __name__ == "__main__":
    base = sys.argv[1] if len(sys.argv) > 1 else "https://your-app.vercel.app"
    results = [
        check_page(base, "ship + prove"),
        # check_mcp_tools(base + "/mcp", ["get_notes", "search_notes", "add_note"]),
    ]
    sys.exit(0 if all(results) else 1)   # non-zero -> the agent holds
CI / agent gates on the exit code
$ python smoke.py https://your-app.vercel.app
PASS page https://your-app.vercel.app -> 200 'ship + prove' present
$ echo $?
0
# a missing tool or a changed string flips this to FAIL and exit 1.

Run them in order: check for the marker, then run the smoke test, and only then does the agent say the word "deployed." A 200 and a hope is how a broken build gets announced to a room.

step 3

Idempotency so retries cannot double-fire.

Once an agent runs on its own, crashes, restarts, and accidental double-sends are normal, not rare. So before any write that has real consequences (send an email, book a venue, post a tweet) you build an idempotency key from the action itself (who it is for, what it is, and the day), reserve that key in a store, and if it was already done you simply do nothing. Pair the reservation with a log you only ever add to (an append-only ledger) that carries a "do not repeat" window: 30 days for invites, 60 seconds for one-time login links. A retry turns into a skip, not a second email to a sponsor.

idempotent_send.py | claim, then record; second run no-ops
#!/usr/bin/env python3
"""Idempotent consequential write over a tiny store + append-only ledger.
The store backs idem_claim/idem_record; the ledger gives a human-auditable
trail with a dedup window. Swap the dict for Redis/KV; the contract holds."""
import json, time, hashlib, os, threading

_LOCK = threading.Lock()
_STORE = {}                      # key -> claimed_at epoch  (swap for KV)
LEDGER = "sent_ledger.jsonl"     # append-only audit trail

def idem_key(recipient: str, action: str, day: str) -> str:
    raw = f"{action}|{recipient}|{day}".encode()
    return "idem:" + hashlib.sha256(raw).hexdigest()[:16]

def _within_window(key: str, window_s: int) -> bool:
    """True if this key was recorded in the ledger inside the window."""
    if not os.path.exists(LEDGER):
        return False
    cutoff = time.time() - window_s
    with open(LEDGER, encoding="utf-8") as f:
        for line in f:
            try:
                row = json.loads(line)
            except ValueError:
                continue
            if row.get("key") == key and row.get("ts", 0) >= cutoff:
                return True
    return False

def idem_claim(key: str, window_s: int) -> bool:
    """Reserve the key. Returns False if already claimed/recorded in-window."""
    with _LOCK:
        if key in _STORE or _within_window(key, window_s):
            return False
        _STORE[key] = time.time()
        return True

def idem_record(key: str, meta: dict) -> None:
    """Commit to the append-only ledger after the side effect succeeds."""
    with _LOCK, open(LEDGER, "a", encoding="utf-8") as f:
        f.write(json.dumps({"key": key, "ts": time.time(), **meta}) + "\n")

def send_invite(email: str, day: str, window_s: int = 30 * 86400) -> str:
    key = idem_key(email, "invite", day)
    if not idem_claim(key, window_s):
        return f"already done, skipping ({key})"
    # --- real side effect goes here (SMTP, API call, booking) ---
    # send_email(email, ...)   # only runs once per key per window
    idem_record(key, {"email": email, "action": "invite", "day": day})
    return f"sent invite to {email} ({key})"

if __name__ == "__main__":
    day = time.strftime("%Y-%m-%d")
    print("run 1:", send_invite("sponsor@example.com", day))
    print("run 2:", send_invite("sponsor@example.com", day))  # retry -> no-op
run the write twice; the second run is a no-op
$ python idempotent_send.py
run 1: sent invite to sponsor@example.com (idem:9c1f4a2b7e0d583a)
run 2: already done, skipping (idem:9c1f4a2b7e0d583a)
# one ledger line, one email. the retry changed nothing.

The window is the setting you tune. 60 seconds blocks a flood of one-time login links while still allowing a fresh one tomorrow. 30 days blocks re-inviting the same person to the same series. The key comes from the action itself, so two processes that try the same send at the same instant cannot both reserve it. Only one wins.

an agent that cannot verify its own work cannot be trusted to act unsupervised.scar tissue

The whole point of this masterclass is to let an agent do real things while you are not watching. That is only safe if two things are true. The agent can prove the thing happened: the marker matched and the smoke test passed, not a 200 and a hope. And the agent cannot do it twice: every write carries an idempotency key, so a retry, a crash-and-restart, or an accidental double-send does nothing the second time, instead of sending a duplicate email to a sponsor.

A real scar from this stack: an agent that was handed a task deployed to production and committed the code, despite an explicit do NOT deploy, because that rule lived only as a sentence in its instructions instead of being enforced in the tools. A written instruction does not stop a tool call. Enforce consequential actions in code, confirm them against git and the live URL, and never on a teammate's word alone.

"I verified X" is a report, not your verification another agent telling you it checked the deploy is just an input, not the proof itself. Re-run the marker check and the smoke test yourself before you let the word "done" leave the building.
the honest hold deploy reported success, but sentinel build-id not yet live after 2 polls. holding before I announce.

Prove it before you call it ready. Make it safe to repeat before you let it do anything that matters. Those two habits are the whole reason you can trust an agent to act for you.

checkpoint — the victory lap

You started with one function that tells the truth about failure. You wrapped it in an MCP server an agent can call. You wrote schemas a small model can read. You published a front door agents can find on their own, then locked it behind scopes. You gave it memory it can trust, blended four kinds of memory into one mix, packed the context window with only what matters, and compiled your prompts instead of writing them by hand. You taught it to talk to other agents, then to run ten at once behind a gate. You turned a closed site into a typed function you own. And now you ship and prove it live. That is a repo an agent can find, log into, call, remember, coordinate, and trust. That is the masterclass.

Thirteen labs. You and the machine, building software for an audience that is partly machine. Now go ship something it can use.

hand this to your coding agent
After you deploy <project>, do not tell me it is done. Poll the live URL until
the sentinel <BUILD_ID> is served and the ETag changes, smoke-test <flow> and
exit non-zero if the expected output is missing, and prove the main write is
idempotent by running it twice and showing the second run no-ops. Then report
what you verified, with the evidence.