The context window forgets. A real agent needs memory that outlives the session: facts saved on disk, pulled back when they are relevant. But memory is a database your agent never thinks to doubt. So the real danger is not forgetting. It is trusting a lie. This lab builds both halves. First, persistence: a tiny store where each fact is one file. Each file starts with a few lines of frontmatter (a small header of labels at the top of the file) and is pulled back by how well its one-line summary matches the question. Then the guard: any fact that names a file, a flag, or an env var (a setting passed in from the environment, like DATABASE_URL) gets re-checked against the live system before the agent is allowed to act on it. Keep the store dead simple. The lesson here is discipline, not infrastructure.
Each memory is a markdown file: a short header, then the fact. The header carries a name (a kebab slug, meaning lowercase-words-joined-by-dashes), a description (one line, used to find it later), and a type (user | project | reference). The body is the fact itself. Every write also adds a one-line pointer to an index file, MEMORY.md, so finding a memory never has to open every file to know what exists. This is the exact shape the system running this masterclass uses for its own memory. Save as memory_store.py.
from pathlib import Path
MEM_DIR = Path("memory")
INDEX = MEM_DIR / "MEMORY.md"
def write_memory(name, description, mtype, body):
"""Write one fact as memory/.md and append a pointer to MEMORY.md.
name kebab-slug, becomes the filename and the recall key
description one line; this is what recall scores against the query
mtype user | project | reference
body the fact itself (plain text)
"""
MEM_DIR.mkdir(exist_ok=True)
path = MEM_DIR / (name + ".md")
front = (
"---\n"
"name: " + name + "\n"
"description: " + description + "\n"
"type: " + mtype + "\n"
"---\n\n"
)
path.write_text(front + body.strip() + "\n", encoding="utf-8")
# One pointer line per memory. Re-running for the same name updates the
# file in place but would duplicate the index line, so dedupe by name.
line = "- [" + name + "](" + name + ".md) -- " + description + "\n"
existing = INDEX.read_text(encoding="utf-8") if INDEX.exists() else ""
kept = [ln for ln in existing.splitlines(keepends=True)
if not ln.startswith("- [" + name + "](")]
INDEX.write_text("".join(kept) + line, encoding="utf-8")
print("wrote " + str(path))
if __name__ == "__main__":
# Two real facts, the kind an agent actually needs to carry forward.
write_memory(
"package-manager",
"which package manager this repo uses",
"project",
"This repo prefers pnpm. Do not use npm or yarn; the lockfile is "
"pnpm-lock.yaml and mixing managers corrupts the store.",
)
write_memory(
"deploy-script",
"how this project ships to production",
"reference",
"Production deploys run scripts/deploy.sh, which builds then pushes "
"to the prod remote. There is no CI auto-deploy; a human runs it.",
)
Run it. You now have memory/package-manager.md, memory/deploy-script.md, and an index that names both in one line each. The whole store is text on disk. No vector database, no running service, no step that turns text into numbers first. The discipline is the structure: one fact per file, a description that decides which one gets pulled, and an index that is the table of contents.
The agent never loads the whole store. It reads the index, scores each memory's description against the question, and opens only the best match. Scoring here is just counting shared words, which is enough to make the point. You can swap in smarter matching later if you outgrow it. The description does the routing for the same reason a tool description did in the schema lab: it is the one line the agent reads to decide. Add this to the same file, above if __name__.
import re
def _index_rows():
"""Parse MEMORY.md back into (name, description) pairs."""
rows = []
if not INDEX.exists():
return rows
for ln in INDEX.read_text(encoding="utf-8").splitlines():
m = re.match(r"- \[(.+?)\]\(.+?\) -- (.+)", ln)
if m:
rows.append((m.group(1), m.group(2)))
return rows
def _score(query, description):
q = set(re.findall(r"[a-z0-9]+", query.lower()))
d = set(re.findall(r"[a-z0-9]+", description.lower()))
return len(q & d)
def recall(query):
"""Return the body of the best-matching memory, or None."""
rows = _index_rows()
if not rows:
return None
name, _desc = max(rows, key=lambda r: _score(query, r[1]))
if _score(query, _desc) == 0:
return None
text = (MEM_DIR / (name + ".md")).read_text(encoding="utf-8")
# Strip frontmatter; return just the fact.
return text.split("---", 2)[-1].strip()
The question never touches a file's body until recall has already picked the winner from the one-line descriptions. That is the whole trick. The index is cheap to scan, the description carries the signal, and only one file gets opened.
print(recall("which package manager"))
# ->
# This repo prefers pnpm. Do not use npm or yarn; the lockfile is
# pnpm-lock.yaml and mixing managers corrupts the store.
Persistence is the easy half. Now the failure mode. Write a fact that was true once and is not anymore: a memory claiming the deploy script lives at scripts/old_deploy.sh, a path that has since moved. The agent recalls it, believes it, and is about to act on it. Then add the guard. verify_before_trust(memory) scans a recalled fact for anything concrete reality can confirm, a file path, a flag, or an env var, and re-checks each one against the live system before the agent acts. A dead path does not get served as fact. It gets flagged to be refreshed.
import os
# Plant a STALE fact on purpose. This was correct before the script moved
# to scripts/deploy.sh; nobody updated the memory. It is not malicious, just
# wrong now, which is the common case and just as dangerous.
def plant_stale():
write_memory(
"deploy-script-old",
"where the deploy script lives",
"reference",
"To ship, run scripts/old_deploy.sh from the repo root.",
)
# The guard. For any recalled fact, pull out the concrete things reality can
# confirm -- file paths, --flags, ENV_VARS -- and check each one live.
PATH_RE = re.compile(r"\b[\w./-]+\.(?:sh|py|ya?ml|json|toml|md|txt|js|ts)\b")
FLAG_RE = re.compile(r"(?
The guard also scans for command flags (like --name) and env vars, using simple text patterns. So the same guard covers a fact like "run with --legacy-peer-deps" or one that assumes DATABASE_URL is set. Anything reality can confirm, the guard confirms. Anything it cannot, it leaves to the human.
plant_stale()
print(recall_and_trust("where is the deploy script"))
# ->
# STALE -- not serving as fact. flag for refresh:
# - path does not exist: scripts/old_deploy.sh
# None
A poisoned memory row or a stale fact is worse than no memory, because the agent acts on it with full confidence and no error to catch. (A poisoned row means a wrong or planted fact in the store the agent later pulls back.) The agent simply does the wrong thing, certain it is right, because the lie came from its own memory. Memory poisoning is the slow-burn cousin of prompt injection (where an attacker hides instructions in text the model reads and gets it to obey them). With memory, the attacker, or just yesterday's reality, pays once, and the agent re-reads the lie forever.
The fix has two parts. First, where it came from and how you frame it. Recalled facts are background context that reflect what was true WHEN WRITTEN. They are not standing orders, and they are not guaranteed true right now. A memory is a snapshot, and you treat it like one. Second, a freshness gate. Any fact that names a file, a function, a flag, or an env var gets re-checked against reality before it is trusted. This is the actual rule the system you are sitting in front of runs on, stated plainly: if a memory names a file or flag, verify it still exists before recommending it. Memory you cannot re-check is memory you cannot trust enough to act on.
Durable memory is leverage. Unverified durable memory is a loaded foot-gun with a long fuse. Build both halves or you have only built the fuse.
Build me a memory store for <agent> where each fact is one file with frontmatter (name, description, type), recall ranks candidates by how well their description matches the query, and an index file lists every memory in one line each. Then add a guard: any recalled fact that names a path, flag, or env var must be re-verified against the live system before the agent acts on it, and stale facts get flagged for refresh instead of served.
Your agent remembers across sessions, and it will not be taken over by a single bad write. Memory is now durable and defended: facts persist as plain files, recall picks the right one by its description, and the freshness guard re-checks every named claim against reality before the agent is allowed to act. You built persistence and the discipline that keeps it honest. Next, two agents stop sharing a process and start talking over a wire.