labs | 09 | meta prompting
lab 09 | ~10 min | masterclass

Stop hand-crafting prompts. Write a prompt that writes the prompt.

Most people write a prompt by hand, tweak it until it works on a few examples, and ship it. The problem: you stopped the moment it worked, so you have no idea if a much better prompt was one edit away. Meta prompting means letting code write and improve the prompt for you. The loop is simple. Make several candidate prompts. Score each one against a test you define. Keep the best. Look at what it still gets wrong, fix that, and go again. Big tools already work this way: Anthropic's Prompt Generator, OpenAI's Generate button, and a library called DSPy. This whole lab runs on your laptop with plain Python 3.11 and nothing else. There is no real AI model in it. A small Python function stands in for the model, so the full loop actually runs with no API key and no internet. We will mark the one spot where you would swap in a real model later.

step 1

The prompt that writes prompts.

Here is the smallest version of the idea. You give a function one line describing your task, like "classify support tickets." It hands back a full, well-structured prompt. That prompt has four parts: a role line that tells the model who it is, a list of rules it must follow, an output schema (the exact shape of the answer you want back, written as JSON), and a couple of example slots. The function below, build_prompt, just assembles those parts from a template, so it needs no AI and no internet. Save it as meta_prompt.py and run it.

meta_prompt.py
# meta_prompt.py -- the prompt that writes prompts.
# Pure Python 3.11 standard library. No API key, no network.
# A meta-prompt is just a function that takes a one-line task and EMITS a
# strong, structured prompt: a role line, explicit constraints, an output
# schema, and labeled few-shot slots. This deterministic version runs offline.

def build_prompt(task: str) -> str:
    """Turn a one-line task into a structured, model-ready prompt."""
    task = task.strip().rstrip(".")
    role = "You are a precise " + task + " assistant."
    constraints = [
        "Return ONLY a JSON object. No prose, no markdown fence.",
        "Every field below is required. Never omit a key.",
        "If you are unsure of a value, use null, never an empty string.",
    ]
    schema = {
        "label": "string -- the chosen category",
        "confidence": "number -- 0.0 to 1.0",
        "reason": "string -- one short sentence",
    }
    lines = []
    lines.append("### ROLE")
    lines.append(role)
    lines.append("")
    lines.append("### TASK")
    lines.append("Given an input, " + task + ".")
    lines.append("")
    lines.append("### CONSTRAINTS")
    for i, c in enumerate(constraints, 1):
        lines.append(str(i) + ". " + c)
    lines.append("")
    lines.append("### OUTPUT SCHEMA (JSON)")
    lines.append("{")
    for k, v in schema.items():
        lines.append('  "' + k + '": ' + v)
    lines.append("}")
    lines.append("")
    lines.append("### EXAMPLES")
    lines.append("input: <example input here>")
    lines.append('output: {"label": "...", "confidence": 0.0, "reason": "..."}')
    lines.append("")
    lines.append("### INPUT")
    lines.append("{input}")
    return "\n".join(lines)


# --- OPTIONAL, key-requiring path (NOT needed to run this lab) --------------
# In production you hand the SAME one-line task to a model behind a published
# meta-prompt and let it write a richer prompt than any template. This is the
# Anthropic Prompt Generator. That path needs an API key:
#
#   from anthropic import Anthropic           # needs ANTHROPIC_API_KEY
#   client = Anthropic()
#   improved = client.messages.create(...)    # model writes the prompt
#
# Everything in this lab runs with zero API access.

if __name__ == "__main__":
    task = "classify support tickets"
    print("=== one-line task ===")
    print(task)
    print()
    print("=== prompt build_prompt() emits ===")
    print(build_prompt(task))
run it
python meta_prompt.py
one line in, a rich structured prompt out
=== one-line task ===
classify support tickets

=== prompt build_prompt() emits ===
### ROLE
You are a precise classify support tickets assistant.

### TASK
Given an input, classify support tickets.

### CONSTRAINTS
1. Return ONLY a JSON object. No prose, no markdown fence.
2. Every field below is required. Never omit a key.
3. If you are unsure of a value, use null, never an empty string.

### OUTPUT SCHEMA (JSON)
{
  "label": string -- the chosen category
  "confidence": number -- 0.0 to 1.0
  "reason": string -- one short sentence
}
...

You typed five words and got back a role, three rules, a typed answer shape, and example slots. That is the easy win: the structure is generated for you instead of retyped by hand every time. In real use you can go one step further and send your one-line task to an actual AI model, which writes an even better prompt than a template can. That is exactly what Anthropic's Prompt Generator does. It needs an API key, so we marked it in the comments above as optional. Everything in this lab works without it.

step 2

Let code pick the best prompt, not you.

This is the idea behind DSPy, a library from Stanford that improves prompts automatically. We will build a tiny version by hand so you can see every part. You need four things. A metric is just a function that scores one answer and returns a number from 0 to 1, where higher is better. A set of test cases is a handful of example inputs paired with the right answer for each. A stand-in model is a small Python function that plays the role of the AI: it reads a prompt and returns an answer, so the whole thing runs with no API key. And the loop itself: it writes several candidate prompts, scores each one across all the test cases, and keeps the highest scorer. You do not pick the winner. The score does. Save this as optimize.py.

optimize.py
# optimize.py -- compile a prompt, do not hand-tune it (the DSPy idea, runnable).
# Signature (inputs -> output), a metric that scores an output, a LOCAL stub
# model so the loop runs offline, then an optimizer that generates variants,
# scores each across cases, and keeps the best.
import json

# signature: what goes in, what must come out
SIGNATURE = {"inputs": ["ticket"], "output": "json:{label,confidence,reason}"}
REQUIRED_KEYS = ("label", "confidence", "reason")

# the cases we optimize against (input + the correct label)
CASES = [
    {"ticket": "I was double charged this month", "gold": "billing"},
    {"ticket": "The app crashes on every login", "gold": "bug"},
    {"ticket": "Please add a dark mode option", "gold": "feature"},
    {"ticket": "How do I export my data", "gold": "howto"},
    # The hard one: "cancel" reads like billing to a naive model, but a
    # cancellation is its own bucket. The best variant still misses it.
    # Step 3's reflection is what fixes it.
    {"ticket": "I want to cancel my subscription", "gold": "account"},
]

# candidate prompt variants. Quality varies ON PURPOSE: some name the schema
# and the JSON requirement, some are vague. A real optimizer mutates these
# automatically (DSPy's MIPROv2 / GEPA); here we enumerate to keep it readable.
VARIANTS = [
    (1, "Classify the ticket."),
    (2, "Classify the ticket into a category. Reply in plain words."),
    (3, "Classify the ticket. Return JSON."),
    (4, "Classify into billing / bug / feature / howto / account. Return "
        "ONLY JSON with keys label, confidence, reason."),
    (5, "Categorize the message somehow and give me back some JSON."),
]

def _route(ticket, rules=""):
    t = ticket.lower()
    # A reflected instruction can inject a disambiguation rule (Step 3).
    if "cancel" in t and "cancel-is-account" in rules:
        return "account"
    if "charg" in t or "refund" in t or "cancel" in t:
        return "billing"
    if "crash" in t or "login" in t:
        return "bug"
    if "add" in t or "dark mode" in t:
        return "feature"
    return "howto"

# LOCAL stub model. Stands in for a real LLM so this runs with NO key. It reads
# the instruction the way a model would: if the prompt asks for JSON AND names
# the required keys, it emits well-formed JSON; weaker prompts make it emit
# prose or a partial object -- exactly how a small model degrades on a vague
# prompt. >>> SWAP THIS for client.messages.create(...) for the real run (that
# path needs ANTHROPIC_API_KEY); the optimizer around it does not change.
def run_model(prompt, case):
    text = prompt.lower()
    wants_json = "json" in text
    names_keys = all(k in text for k in REQUIRED_KEYS)
    label = _route(case["ticket"], rules=text)
    if wants_json and names_keys:
        return json.dumps({"label": label, "confidence": 0.9, "reason": "kw"})
    if wants_json:                       # asked for JSON but under-specified
        return json.dumps({"label": label})        # missing required keys
    return "This looks like a " + label + " ticket."   # prose, not parseable

# metric: deterministic, no model needed to grade. Valid JSON + all required
# keys + correct label. Returns 0.0 .. 1.0. THIS is what you tune toward.
def metric(raw, case):
    try:
        obj = json.loads(raw)
    except (json.JSONDecodeError, TypeError):
        return 0.0
    if not all(k in obj for k in REQUIRED_KEYS):
        return 0.4                       # parseable but schema-incomplete
    return 1.0 if obj.get("label") == case["gold"] else 0.6

def score_prompt(prompt):
    return sum(metric(run_model(prompt, c), c) for c in CASES) / len(CASES)

# optimizer: score every variant across all cases, keep the best.
def optimize():
    rows = []
    for vid, instr in VARIANTS:
        prompt = instr + "\n\nTicket: {ticket}"
        rows.append((vid, round(score_prompt(prompt), 2), instr))
    rows.sort(key=lambda r: (-r[1], r[0]))   # best score first
    return rows

if __name__ == "__main__":
    rows = optimize()
    print("variant  score  instruction")
    for vid, score, instr in rows:
        short = instr if len(instr) <= 48 else instr[:45] + "..."
        print("   " + str(vid) + "      " + ("%.2f" % score) + "   " + short)
    bid, bscore, binstr = rows[0]
    print("-> selected variant " + str(bid) + ", score " + ("%.2f" % bscore))
run it
python optimize.py
the loop scored every candidate prompt and picked the winner
variant  score  instruction
   4      0.92   Classify into billing / bug / feature / howto...
   3      0.40   Classify the ticket. Return JSON.
   5      0.40   Categorize the message somehow and give me ba...
   1      0.00   Classify the ticket.
   2      0.00   Classify the ticket into a category. Reply in...
-> selected variant 4, score 0.92

You did not pick the winner. The score did. Variant 4 wins because it spells out that the answer must be JSON and lists the categories. The vague prompts score 0.00 because the stand-in model replies in plain sentences instead of JSON, and the metric cannot read that, which is exactly how a weaker real model would fail. Notice variant 4 scored 0.92, not a perfect 1.00. That means it still gets one test case wrong. Step 3 hunts down that one case.

step 3

Look at what still fails, then fix it.

The winning prompt scored 0.92, so one test case is still wrong. The last move is to look at that failure and write one extra instruction aimed straight at it. The function below, reflect, does that: it finds the cases the winner missed and adds a sentence to the prompt to cover them. Here we add the sentence with a simple rule. A tool called TextGrad does the same thing with a real AI model reading the failures and writing the fix itself. They call that written-out fix a textual gradient, which is just a fancy way of saying "a note in plain English telling the prompt how to improve." Add this code to optimize.py and run it.

add to optimize.py
# --- reflect: turn the failing cases into a better prompt --------------------
# Rule-based locally; the LLM version is TextGrad's "textual gradient" -- a
# model reads the failures and writes the patch. Here we derive it ourselves.

def failures(prompt):
    return [c for c in CASES if metric(run_model(prompt, c), c) < 1.0]

def reflect(prompt, failed):
    note = ""
    if any("cancel" in c["ticket"].lower() for c in failed):
        # the gradient: a sentence aimed exactly at the observed mistake
        note = ("\nRule: cancelling or closing an account is "
                "label=account, not billing. [cancel-is-account]")
    return prompt + note

def reflect_demo():
    winner = "Classify into billing / bug / feature / howto / account. Return " \
             "ONLY JSON with keys label, confidence, reason.\n\nTicket: {ticket}"
    fails = failures(winner)
    names = ", ".join('"' + c["ticket"] + '"' for c in fails)
    print("winner fails " + str(len(fails)) + " case(s): " + names)
    v6 = reflect(winner, fails)                  # the 6th, reflected prompt
    print("reflected variant 6 scores " + ("%.2f" % score_prompt(v6)))
    print("appended gradient: " + v6.splitlines()[-1])

# Run me:  python -c "import optimize; optimize.reflect_demo()"
run it
python -c "import optimize; optimize.reflect_demo()"
the 6th, reflected prompt fixes the case the winner missed
winner fails 1 case(s): "I want to cancel my subscription"
reflected variant 6 scores 1.00
appended gradient: Rule: cancelling or closing an account is label=account, not billing. [cancel-is-account]

The loop found a 0.92 winner. Then it read the one case that winner still missed, wrote a single sentence aimed at it, and the new prompt scored a perfect 1.00. You never edited a prompt by guesswork. You made candidates, scored them, kept the best, and fixed what was left, and the whole thing ran itself.

a prompt you stopped tuning is rarely the best oneDSPy / TextGrad / Anthropic

Here is the trap most people are still in. You write a prompt, it works on your three test cases, and you ship it. The danger is that you stopped the second it worked. A much better prompt could have been one small edit away, and you will never know. Meta prompting fixes this by treating the prompt the way you treat code: make several versions, score each one, keep the best, look at what it still gets wrong, and improve it. Then do it again.

This is not theory. Anthropic's Prompt Generator cut a company called ZoomInfo's prompt-tuning time by 80 percent. DSPy, the Stanford library whose motto is "programming, not prompting," has built-in tuners (with the names MIPROv2 and GEPA) that improve your prompts for you by having an AI model review and rewrite them. Tobi Lutke, the CEO of Shopify, calls DSPy his go-to tool for this. A related tool, TextGrad, has an AI model read the failures and write the fix in plain English, then loop, the same way training nudges a neural network toward a better answer. The takeaway in one line: stop writing prompts by hand, let code improve them.

Here is what it sounds like when your coding agent does this for you instead of you doing it by hand:

your coding agent, improving a prompt hand-tuned prompt passes 3 of 5 cases; generated 8 variants, scored them, variant 6 passes 5 of 5, shipping that one.

You just ran that exact loop by hand in three short files. In real projects the only things that change are scale: a real AI model writes the new versions and the fixes instead of a template, and your metric grades against your real data instead of five toy tickets. The loop is the same one you just built.

+

Hand this to your coding agent.

You ran the loop by hand so you could see every moving part. Now have your coding agent build the same thing for a task you actually care about. Paste the text below exactly as it is, then read the code it writes to see how it set up the scoring.

prompt | paste into your coding agent
Write me a meta-prompt that turns a one-line task into a structured prompt
with a role, explicit constraints, and an output schema. Then build a tiny
optimizer: generate 5 prompt variants, score each against a metric I define,
keep the best, and reflect on the cases it still fails to propose a 6th. Show
me the winning prompt and the score that made it win.

checkpoint

You stopped writing prompts by hand and let code improve them instead: make candidates, score them, keep the winner, fix what is left, repeat. This is the same loop the DSPy library runs for you, and you just did it yourself. The prompt is no longer a guess you walked away from the moment it worked. It now has a score attached, so it can keep getting better without you typing a single new instruction.