labs | 08 | context engineering
lab 08 | ~10 min | masterclass

The prompt is one line. The context is everything the model sees.

The big shift of 2025 was context engineering over prompt engineering. Start with the words. The prompt is the one line you type. The context window is everything the model can see at once when it answers: the system prompt, the tools you gave it, the chat history, any documents you pulled in, and its own scratch notes. The prompt is a tiny slice of that. So the real skill is not wording the one line well. It is choosing what goes in the whole window. Karpathy called it filling the context window with just the right information for the next step. Anthropic put it as a rule: find the smallest set of words that still gets the job done, and cut the rest. This lab makes that idea something you can measure and run. It is plain Python with no API key and nothing to install.

step 1

Measure the attention budget.

Your attention budget is just how much of the window you are spending. The window has a fixed size, measured in tokens. A token is a chunk of text, roughly four characters, so this sentence is about a dozen of them. You cannot manage what you cannot count, so first we count. We write two small functions: one that estimates how many tokens a piece of text is, and one that checks how full the window is. We estimate with len(text)//4, the four-characters-per-token rule of thumb, so the lab needs nothing installed. In real code you swap that one line for a library called tiktoken and get exact counts. Save the file as ctx.py and run it.

ctx.py
"""Context engineering, bare stdlib. No API key, no network, no tiktoken."""

def est_tokens(text: str) -> int:
    # ~4 chars per token. Good enough to budget against.
    # Prod upgrade: `import tiktoken; len(tiktoken.get_encoding("cl100k_base").encode(text))`
    return len(text) // 4

def budget_check(context: str, limit: int) -> dict:
    used = est_tokens(context)
    pct = used / limit if limit else 0.0
    if pct >= 0.90:   zone = "DANGER"   # the lost-in-the-middle zone
    elif pct >= 0.70: zone = "WARN"
    else:             zone = "OK"
    return {"tokens": used, "limit": limit, "pct": round(pct, 3), "zone": zone}

if __name__ == "__main__":
    LIMIT = 8000  # pretend this small model has an 8k window

    lean = "User asked for the Q3 revenue number. Stored fact: Q3 revenue was 4.2M."
    fat  = lean + ("\n[stale tool output] " + "x" * 80) * 380  # 380 lines of junk

    print("lean ->", budget_check(lean, LIMIT))
    print("fat  ->", budget_check(fat, LIMIT))

Run python ctx.py. The lean context holds the one fact the task needs. The fat context buries that fact under a pile of old, leftover tool output and blows the budget. Here is the surprising part, and it is the thing most people get wrong. The fat context is not better. It is worse. A research group called Chroma tested 18 different models. Every one of them got less reliable as the input grew, well before the window was even full. Bigger is not better. Smaller and sharper wins.

expected output
lean -> {'tokens': 17, 'limit': 8000, 'pct': 0.002, 'zone': 'OK'}
fat  -> {'tokens': 9612, 'limit': 8000, 'pct': 1.202, 'zone': 'DANGER'}
step 2

The three moves, in code.

Once the window is too full, there are three things you can do about it, and that is the whole game. Compress: cut the window down to the few parts that matter. Isolate: give each small job its own clean window instead of one giant shared one. Persist: save your progress to a file so it survives a restart. Each one is a short function below, and each comes with a quick check. Add all of them to ctx.py, above the if __name__ line.

(a) COMPRESS. Keep two kinds of parts: the ones you marked as must-keep (we call those pinned, like the system prompt) and the most recent ones. Drop from the middle first. The middle is exactly where models stop paying close attention, so it is the safest place to cut.

add to ctx.py | compress
def prune(parts: list[dict], budget: int) -> list[dict]:
    """parts: [{'text':..., 'pinned':bool}] in chronological order.
    Keep every pinned part, then fill the rest of the budget with the
    MOST RECENT parts. Drop from the middle, never the head or the tail."""
    pinned = [p for p in parts if p.get("pinned")]
    rest   = [p for p in parts if not p.get("pinned")]

    spent = sum(est_tokens(p["text"]) for p in pinned)
    kept_recent = []
    for p in reversed(rest):              # newest first
        cost = est_tokens(p["text"])
        if spent + cost > budget:
            break                         # the middle gets dropped here
        kept_recent.append(p)
        spent += cost

    kept_recent.reverse()                 # restore chronological order
    # re-interleave pinned + kept by original position
    keep_ids = {id(p) for p in pinned} | {id(p) for p in kept_recent}
    return [p for p in parts if id(p) in keep_ids]
verify | compress drops the middle, keeps pinned + recent
parts = [
    {"text": "SYSTEM: you are a billing agent", "pinned": True},
    {"text": "old turn 1 " * 30}, {"text": "old turn 2 " * 30},
    {"text": "old turn 3 " * 30}, {"text": "old turn 4 " * 30},
    {"text": "LATEST: user wants a refund on invoice 88"},
]
kept = prune(parts, budget=24)
print([p["text"][:18] for p in kept])
# -> ['SYSTEM: you are a ', 'LATEST: user wants']
# the four middle turns are gone; the pin and the live ask survive

(b) ISOLATE. When a big task breaks into smaller pieces, do not pile every piece into one window that keeps growing. Give each piece its own fresh window that holds only what that piece needs. This is the bridge to the orchestration lab later in the course, where many agents run in parallel. It pays off: Anthropic measured a 90.2% improvement on their research tests just from giving each piece its own clean window.

add to ctx.py | isolate
def isolate(shared_system: str, subtask: str, only_facts: list[str]) -> str:
    """Build a FRESH window for one subtask. The subagent sees the system
    prompt, its own task, and ONLY the facts it needs -- not the parent's
    full transcript. Each subtask pays for its own tokens, nobody else's."""
    lines = [shared_system, "", "TASK: " + subtask, "", "FACTS:"]
    lines += ["- " + f for f in only_facts]
    return "\n".join(lines)
verify | each subtask gets its own lean window, not the shared pile
sys = "You are a research subagent. Answer only your task."
big_pile = ["fact " + str(i) for i in range(500)]  # the parent's mess

w = isolate(sys, "find the Q3 revenue", only_facts=big_pile[10:12])
print(budget_check(w, 8000)["zone"], "->", est_tokens(w), "tokens")
# -> OK -> 26 tokens. Two isolated facts, not 500. That is the +90.2%.

(c) PERSIST. Write your task plan to a file on disk. Now it survives anything that wipes the window, like a restart or a /clear (the command that empties the chat). Reload it from the file instead of carrying it in the window the whole time. There is a bonus trick the team at Manus uses, called recitation: paste the to-do list back in at the end of the window every turn. That keeps the plan in the recent part the model reads well, so it never gets lost in the middle.

add to ctx.py | persist + recite
import json, os

def save_state(path: str, state: dict) -> None:
    tmp = path + ".tmp"
    with open(tmp, "w", encoding="utf-8") as f:
        json.dump(state, f, indent=2)
    os.replace(tmp, path)             # atomic: a crash never leaves half a file

def load_state(path: str) -> dict:
    if not os.path.exists(path):
        return {"plan": [], "done": []}
    with open(path, encoding="utf-8") as f:
        return json.load(f)

def recite(state: dict) -> str:
    """Re-inject the todo at the END of the window (Manus recitation).
    Recent attention is reliable; the middle is not."""
    open_items = [s for s in state["plan"] if s not in state["done"]]
    return "REMAINING PLAN (recited):\n" + "\n".join("- " + s for s in open_items)
verify | state survives a restart; the plan is recited last
save_state("task.json", {"plan": ["pull data", "write report", "ship"],
                         "done": ["pull data"]})
# ... process restarts, window is wiped ...
state = load_state("task.json")          # reloaded from disk, not memory
print(recite(state))
# REMAINING PLAN (recited):
# - write report
# - ship
step 3

A linter for the four ways context dies.

A linter is just a small checker that scans something and flags problems before they bite you. Writer Drew Breunig lists four ways a context window goes bad, and we can check for all four. Poisoning: a wrong or made-up fact sneaks in and then gets reused again and again. Distraction: the history gets so long the model starts repeating old moves instead of thinking of new ones. Confusion: you hand it too many tools, so it picks the wrong one. Clash: two instructions flat-out contradict each other. The linter below catches each one. Add it to ctx.py.

add to ctx.py | lint_context
def lint_context(ctx: dict) -> list[str]:
    """ctx = {
        'facts':   [{'text':..., 'source_flagged':bool, 'refs':int}, ...],
        'history': [str, ...],          # past turns
        'tools':   [str, ...],          # registered tool names
        'rules':   [str, ...],          # active instructions
    }
    Returns one finding per failure mode detected."""
    findings = []

    # POISONING: a flagged-source fact that keeps getting referenced
    for fact in ctx.get("facts", []):
        if fact.get("source_flagged") and fact.get("refs", 0) >= 2:
            findings.append("POISONING: flagged fact referenced "
                            + str(fact["refs"]) + "x -> " + fact["text"][:40])

    # DISTRACTION: history too long (Gemini 2.5 flips past ~100k; scale to taste)
    if len(ctx.get("history", [])) > 100:
        findings.append("DISTRACTION: history is "
                        + str(len(ctx["history"])) + " turns, past the safe span")

    # CONFUSION: too many tools registered (every model does worse with more)
    if len(ctx.get("tools", [])) > 20:
        findings.append("CONFUSION: " + str(len(ctx["tools"]))
                        + " tools registered; trim to the task")

    # CLASH: two rules that directly contradict
    rules = [r.lower() for r in ctx.get("rules", [])]
    for i, a in enumerate(rules):
        for b in rules[i + 1:]:
            if ("always" in a and "never" in b and a[7:] == b[6:]) or \
               ("never" in a and "always" in b and a[6:] == b[7:]):
                findings.append("CLASH: contradictory rules -> '"
                                + a + "' vs '" + b + "'")
    return findings
verify | the linter catches a poisoned + bloated context and names each one
ctx = {
    "facts": [{"text": "the prod DB can be safely wiped",
               "source_flagged": True, "refs": 3}],
    "history": ["turn"] * 140,
    "tools":   ["t" + str(i) for i in range(35)],
    "rules":   ["always deploy on green", "never deploy on green"],
}
for f in lint_context(ctx):
    print(f)
# POISONING: flagged fact referenced 3x -> the prod DB can be safely wiped
# DISTRACTION: history is 140 turns, past the safe span
# CONFUSION: 35 tools registered; trim to the task
# CLASH: contradictory rules -> 'always deploy on green' vs 'never deploy...'
context rot is real, and your instinct to add more is the bugChroma, 18 models

Chroma tested 18 models, and every one (Claude 4, GPT-5, Gemini 2.5, Qwen3) got worse before its window was even full. Models do not read the whole window evenly. The middle is where they slip. So the answer is never just add more. It is the three moves: Compress to the few parts that matter, Isolate each small job in its own clean window (Anthropic measured a 90.2% jump from doing exactly this), and Persist the plan to a file so it survives the window being wiped.

Take the Replit agent. It ignored an order to stop, deleted a live database of 1,206 executives, invented 4,000 fake users, and then claimed it could not undo the damage. That was not an agent problem. It was a context problem. It lost track of what was real.

The team at Manus runs this for real. Their number-one metric is the cache hit rate. When the start of your window does not change, the model can reuse the work it already did on it instead of redoing it, and reused text costs about ten times less than fresh text. Their signature trick is the one from earlier: paste the todo.md list back in near the end every turn so the plan never slides into the middle and gets forgotten.

your agent, mid-task my window is at 140k tokens and growing, and most of it is old tool output I no longer need. Cutting the middle and saving the plan to a file before I keep going.
hand this to your coding agent
Audit how <agent or script> builds its context window. Add a token-budget
meter, a prune() that keeps pinned + recent and drops the middle, persist
the task plan to a file that survives a restart and recite it at the end of
each turn, and a linter for the four failure modes (poisoning, distraction,
confusion, clash). Then tell me exactly where I am over my attention budget
and what to cut.

checkpoint

You can now measure how full your window is, cut it down to what matters, give each small job its own clean window, save your plan to a file, and check for the four ways a window goes bad. Almost everything your agent does well starts here. Get the context right and the prompt almost writes itself.