Every time you describe what you want an LLM to build, something is lost. Not because the LLM is dumb. Because English is vague by design, and code is precise by necessity. The gap between those two things is where bugs are born—and we’ve been here before, just not quite as confidently. Mid-2000s software development knew we couldn’t get this right and used a variety of tools to attempt to account for this—ERDs, DFDs, pseudocode—all attempts to close the gap between what we meant and what got built. LLMs can generate all of them for you now. The gap didn’t close—the artifact just got cheaper.

TL;DR: LLMs perform statistical alchemy—expanding vague English into precise code. But the expansion is a guess. Without a way to persist intention outside the code itself, validation, debugging, and scaling AI coding pipelines all suffer from the same root problem: you can’t audit what you can’t intend.

Statistical alchemy

When you write a prompt, the LLM doesn’t execute your intention. It makes a statistical prediction about what you probably meant. This distinction matters, but we don’t consciously acknowledge it.

“Statistically, you intend to do X.” That’s the contract. The model is confident, the code looks right, and the tests may even pass. But the code reflects a probabilistic approximation of your intention—not the intention itself.

I think of this as statistical alchemy: expanding a few words into many, filling in the gaps with the most likely interpretation. The output is sophisticated and often correct. But it’s not derived from truth. It’s derived from pattern.

You see this in the wild every time an AI answer at the top of a Google search is confidently wrong—or gives a different answer tomorrow than it did today. It’s not a lookup. It’s a guess. I’ve been trying to explain this to my aging parents: agents are non-deterministic, so the answer changes every time. The same thing happens with code.

The precision gap

Describing a program precisely in English has always been hard. That difficulty is actually why we have programming languages. Every step closer to natural language is a step away from precision.

The higher the precision of your description, the closer it looks to code. If you could describe exactly what you wanted with zero ambiguity, you’d essentially be writing code. Programming errors aren’t bugs in the machine—they’re precision failures. The code does exactly what it was told. The gap was in the telling.

This doesn’t get easier with LLMs. It gets more insidious. The model fills precision gaps silently, with statistical guesses. A developer writing incorrect code at least spent time with the code, marinated on it, and had to understand what they wrote. An LLM generating incorrect code makes the error invisible until you look closely—and sometimes not even then. Ironically, feeding your code back to an LLM with a bit of context about what you were trying to accomplish may result in it catching bugs it originally authored.

The validation paradox

To validate that generated code is correct, you need to understand the original intention.

But the original intention was expressed in English—which is vague. The code is a statistical expansion of that vagueness. So when you go to validate, you’re auditing an approximation of an approximation.

This is what I think of as hallucinated intention. The code appears to have a coherent purpose. It’s internally consistent. It compiles. The LLM “intended” something—but what it intended is a probabilistic reconstruction of what you meant, not what you actually meant.

Reading code to verify correctness requires you to reconstruct the original intention from the expansion. You’re trying to work backwards through detail that was never there.

Why three LLMs don’t help

The natural instinct is to add more AI to the validation problem. A pipeline of:

  1. One LLM to write the code
  2. One LLM to fix the code
  3. One LLM to operate and verify it works

This can work—but only if you can give each stage a precise description of intention. The problem is each one is starting from the same imprecise English specification as the last.

The third LLM verifying that code “works correctly” is measuring against its own statistical reconstruction of what “correctly” means. If all three are working from a lossy English description, you have three different probabilistic approximations of the same intention with no way to know if any of them match the original.

Validation loops

One of the techniques I’ve been playing with is validation loops—giving an LLM a way to validate it wrote code that worked. AKA Test-Driven Development.

My workflow looks like:

  1. Research
  2. Plan
  3. Work with AI to write and validate tests
  4. Break up the plan into tasks and fan out to multiple agents that run tests as they’re building
  5. Have a second LLM with a high-level understanding of context audit the code.

It’s promising. But at the end of the day, it’s still English. You’re still extrapolating specificity from vagueness, which means I’m still reviewing code critically myself. I don’t want to subject my coworkers to AI slop.

What we actually need

I don’t have a complete answer. But the shape of the solution requires two things:

A log of intention and decisions. Not just specs—a durable record of why decisions were made. What options were considered and rejected. What constraints shaped the implementation. Something agents can consult when validating or extending code—not just “what should it do” but “why does it do it this way.”

Good tests that are audited and maintained. Tests are the closest thing we have to frozen intention. A test says: at this moment in time, a human decided this behavior was correct. That’s enormously valuable—but only if the tests are meaningful. Not generated and forgotten, not passing for the wrong reasons.

A decision log combined with a maintained test suite is as close as we can currently get to persisted intention. It’s not elegant. It’s not fully automatable. It requires effort and a human in the loop.

That might be the point.

The intention debt

We’re accumulating intention debt the same way we accumulate technical debt—and we’re doing it faster. LLMs write code quickly. Intention documentation accrues slowly, if at all. Context windows are small. LLMs can only operate on small pieces at a time.

Technical debt is recoverable because the code still exists. Intention debt is harder to recover because the intention often lives only in the head of the person who wrote the prompt—who has already moved on to the next task even faster. Without spending time deep in the code, and tackling more projects than ever, the half-life of that knowledge shortens from 6 months to 2.

Until we have better tooling for capturing and persisting intention, the “AI writes, AI validates, AI operates” pipeline will remain fragile at exactly the seams where intention needs to transfer between systems.

Code is now the easy part. The hard part is operation and carrying the intention forward.