LAISI - I built Agentic Workflows before Anthropic did

2026-06-03

Earlier this year I was building myself a knowledge pipeline. The setup was simple in description and stubborn in practice. Follow a watchlist of YouTube channels, blogs, company posts, news sites. Pull the new content. Extract the ideas worth keeping. Turn those ideas into structured notes I could then connect, query, and reuse.

The hard part was the middle step. I needed the LLM to produce the structured output. A specific shape, every time, so the downstream code could rely on it.

The AI ignored the shape.

Not loudly. Not with a clear error. It just generated something that looked like the template, declared itself done, and moved on. When I read what came out, half the structure was missing and the rest was renamed to whatever the model thought was nicer. The session said “completed.” The artifact said otherwise.

That was the day I stopped trusting the receipt.

I built a small tool that spring. I called it LAISI. It is not finished. It has been vibe-coded in long evenings and not shipped to anyone. But the idea inside it survived a few months and a major release from Anthropic, and that is what this post is about. Not the artifact. The idea.

AI Code Hallucinations

The behavior I just described has a name now. People call it different things. The label I keep coming back to is hallucination, even though that word usually gets used for invented facts. What I saw was structural hallucination. The agent claimed to have followed a template it did not actually follow.

The model is not lying on purpose. It does not know the difference. It has been trained to produce confident, well-formed output. When the output is not quite what you asked for, the model still presents it as if it were. Confidence is a sampling property, not a truth signal.

For short tasks this is fine. You read the output, you correct it, you move on. For longer tasks, where step three is supposed to consume the structured output of step two, the lie metastasizes. Step two produces something that looks like a schema but is not one. Step three reads it, generates more confident-looking output on top of that wrong foundation, and now the whole pipeline is built on quicksand.

The thing I needed was not a better prompt. It was a way to refuse the lie at the door.

Context Pollution

The other half of the problem only shows up if you stay in one session long enough.

I would start a fresh conversation, lay out the architecture, give the agent the conventions, work through the first three subtasks beautifully. Then somewhere around the fifth or sixth turn, the agent would forget the conventions. The variable names would drift. The architectural rules would loosen. The same model that nailed task one would start improvising on task four.

It was not malice. The context window had filled with our conversation, our intermediate outputs, our debates. The instructions from the start of the session got pushed further from the model’s working attention. The signal got diluted by noise we ourselves generated.

I read later that the major labs have a name for the failure modes that emerge from this. Agentic laziness. Goal drift. Self-preferential bias. All of it is downstream of the same physical fact. A single conversation in a single context window is a finite room, and the longer you stay in it, the louder the room gets.

Verification Asymmetry

This is the insight that gave me a place to stand.

You cannot make a probabilistic model more honest by asking it to be more honest. You can only make it cheaper to verify what it produced. Generating a passable XML costs the model a few thousand tokens. Validating that same XML against a schema costs zero tokens. The validator is a small piece of code that does its job in milliseconds and never talks itself out of its judgment.

The economics here are not subtle. If verification is roughly free and generation is roughly expensive, the right architecture is to generate once and verify deterministically. Not to generate, then ask the model to grade itself. That second step is the same probabilistic process that produced the lie in the first place.

Verification is the asymmetry. Schemas are the cheapest verifier we have.

Schema-Validated Workflow Step

LAISI stands for “Let AI Supervise Itself,” which is the entire thesis in one sentence. The tool itself is a small TypeScript CLI on Node. It reads a YAML file. The YAML defines a workflow as a list of steps. The unit pattern is the schema-validated workflow step. Every LLM step declares two things in the YAML. A prompt template, and an XSD schema.

The runner sends the prompt to the model. The model produces XML. The runner parses the XML and validates it against the XSD. If validation passes, the validated XML becomes the input to the next step.

If validation fails, LAISI does not stop. It loops. The same step is called again, this time with the failed XML appended to the prompt along with the exact validation error, and an instruction to correct it. The model tries again. The schema checks again. Pass, and the step is done. Fail, and the loop repeats, up to a per-workflow retry budget.

That inner loop is borrowed straight from Geoffrey Huntley’s Ralph Loop pattern: let the model iterate against its own output until convergence. LAISI is, in effect, lots of mini Ralph Loops chained by a YAML, each one bounded by a schema instead of a human judgment call about “good enough.” The schema is the convergence test; the Ralph-shaped loop is how convergence gets reached.

The model never tells me the step worked. The schema does. And when the schema disagrees, the model gets another swing, with the schema’s complaint in its hands.

workflow: knowledge-pipeline
description: "Extract structured concepts from a watchlist of sources"
max_retries: 3

steps:
  - id: extract_concepts
    description: "Pull concepts from the day's sources"
    prompt: extract_concepts.md
    schema: extract_concepts.xsd

  - id: persist_concepts
    description: "Save extracted concepts to the vault"
    predecessor: extract_concepts
    script: scripts/save_concepts.sh

That is the entire mental model. Prompt in. XML out. Schema decides if the step happened. The next step only runs if the previous step’s output cleared the gate.

Polyglot Workflow Step

The second design choice in LAISI is that script steps and LLM steps are equal citizens.

Some work is just code. Fetching a file, converting a format, normalizing a path, running a converter. Forcing that work through a language model is wasteful. It costs tokens, it introduces a failure mode that did not need to exist, and it makes the workflow harder to reason about. So scripts and LLM steps live at the same level of the YAML, with the same step interface. Inputs, outputs, identity. The runner does not care which kind a step is.

This sounds obvious but it is not how most agentic frameworks are organized. The dominant pattern wraps shell calls as second-class “tools” available to an LLM-centric runtime. The LLM is in charge. The script is a thing the LLM may decide to use.

LAISI inverts that. The workflow is in charge. The LLM is a step type. The script is a step type. They both produce structured outputs. They both feed the next step. The orchestrator is a deterministic state machine reading YAML, not a conversational agent improvising about which tool to use next.

That is the polyglot workflow step. Code where code belongs, models where models belong, both addressed the same way.

XSD as Output Contract

The world picked JSON Schema. Pydantic, Zod, Instructor, BAML. They all converged on the same family of contracts and they did it for good reasons. JSON is everywhere. JSON Schema integrates with every modern stack.

I picked XSD anyway.

There are four reasons, and the order matters because I picked XSD first for the most practical one.

The first reason is that LLMs generate XML more reliably than JSON. XML is older. There is more XML in the training corpus, more parsers in more languages, more decades of well-formed XML to imitate. When I needed the model to produce something conforming to a precise structure, the older format had a measurably higher hit rate. Paired with the validation loop, this matters even more: the model has a real chance of converging on a valid output, where with JSON it would keep producing close-but-wrong shapes that look right and aren’t.

The second reason is that XSD predates the AI hype by two decades. The tooling is mature, the spec is stable, validators exist in every language I am likely to use. The bet I was making was that ecosystem maturity would matter more than ecosystem fashion. Six months later I am still betting on that.

The third reason is types. XSD has a real type system. Dates are dates. Decimals are decimals. Enums are enums. Cardinalities are explicit. You can express “exactly two of these” or “at least one and no more than five” without writing custom validation code. JSON Schema can do most of this too, but XSD does it as a first-class concern, not as an extension.

The fourth reason is namespaces. When workflows compose, when one schema imports another, when you want to evolve a format without breaking the workflows that depend on the old one, namespaces matter. JSON has nothing equivalent with the same weight of tooling behind it.

The contrarian move was deliberate. I might be wrong. But the math worked out for me in February and it still works out now.

Dynamic Workflow

On May 28, Anthropic announced dynamic workflows in Claude Code. I read the post that afternoon.

The shape was familiar. Claude writes a JavaScript orchestration script. The script spawns subagents. The subagents do work in parallel, in fresh contexts, with their own model assignments and isolated worktrees. Results get checked before being integrated. The whole orchestration can resume if it crashes halfway through.

A few days later they shipped a longer write-up. A harness for every task. They cataloged the patterns. Fan-out and synthesize. Adversarial verification. Loop-until-done. Tournaments. The same handful of shapes I had been hand-coding in YAML, now formalized inside a bigger and better engineered system, addressing the same failure modes that motivated my own tool.

I am not going to pretend this was uncomplicated to read. There was the small ego sting of “they did the larger version of my idea.” There was the larger relief of “the pattern is real and now it is mainstream.” Both at once.

Mostly there was recognition. The shape was right. The mechanism was right. They are doing this for a reason and the reason is the same reason I started doing it.

Output Validation Gap

Here is the piece they did not ship.

The verification inside Anthropic’s dynamic workflows is Claude grading Claude. A subagent produces a finding, another subagent is asked to refute it. That is fine. It is better than nothing. For frontier models with deep context, model-on-model verification catches a lot.

But it is still probabilistic. The verifier and the generator come from the same distribution. They make correlated mistakes. A finding the generator was confident about is one the verifier will sometimes also be confident about, even when both are wrong. The hallucination is not eliminated by adding another participant drawn from the same population.

What is missing is a deterministic contract on each step’s output. A schema that says: this output must conform to this structure, with these types, in this namespace, or the run fails. Not “another model thinks it looks right.” Not “the orchestrator’s checker subagent agrees.” A piece of code that cannot be talked out of its judgment.

For most users, with frontier models and frontier budgets, the model-on-model verification is good enough. For the corner of the world I have come to care about, it is not.

Small Model Supremacy

The frontier-model story keeps getting better. Opus 4.7 ships with a million-token context window. The room that filled up in February is bigger now. The original pressure that made me build LAISI is less acute on big models.

But not every model is a big model.

A 7B model running on a workstation has a small context window and a thin reliability margin. A local model running offline is doing real work in places where uploading customer data to a cloud API is not an option. A self-hosted model running behind a firewall in a regulated industry has no luxury to spare. In all of those places, every reliability lever matters more, not less.

Schema validation gives a small model a structural floor it cannot fall through. The model might produce something half-right. The validator catches it before the next step builds on top of it. That is not glamorous. It is also the difference between a workflow that small-model teams can actually deploy and a workflow that only works on top of a model none of them are allowed to use.

This is the corner LAISI was built for. I did not know it in February. I know it now.

Convergent Tool Building

A few months earlier, I had built something else.

I was sitting at my Mac, talking to Claude, and tired of typing. I wanted to dictate. macOS has built-in voice input. I tried it. It was awkward enough that I went back to the keyboard within a day.

So I built a small tool. I called it voice-type. A macOS app that captures voice input the way I wanted it to work. Push a key, talk, get the text where I wanted it. I wrote it entirely with Claude Code. I never looked at the code that came out. I tested whether it worked. It did, and I used it every day for months.

A few weeks ago Claude Code shipped its own voice mode. It is built in now. I use it every day. I have not opened voice-type since.

That is the second time this year it has happened to me. I build a small thing for myself because nothing on the shelf fits. A major lab ships a polished version. The small thing becomes obsolete. The pattern survives, just not the artifact.

I am not the only person this happens to. When you build for a specific friction in a specific moment, you are often a few months ahead of the teams building for the general case. Sometimes they catch up and replace what you made. That is not a failure. That is convergent tool building, and it is its own kind of evidence. The thing was worth building. You just were not the one with the engineering team to finish it.

Agentic Engineering vs Vibe Coding

I should be honest about the artifacts.

LAISI is heavily vibe-coded. So is voice-type. I never read the code in either of them. I tested whether the application behaved correctly and shipped on that signal alone.

By the strict definition of agentic engineering, of reusable patterns refined through deliberate practice, neither tool qualifies. They are artifacts of someone who saw a problem and threw a working solution at it.

And yet the ideas inside them are engineering ideas. Schemas and deterministic checkpoints in LAISI. A focused voice-to-text affordance in voice-type. Those are not vibe-coding values. They are engineering values that I happened to express in vibe-coded artifacts.

That is the useful split. Vibe coding is a fine way to find the idea worth promoting to engineering later. The mistake would be to treat the messy artifact as the point. The point is the idea, and in both cases the idea was worth keeping.

Bet on the Boring Layer

Here is the part I keep coming back to.

Orchestration is the exciting layer. Everyone is building orchestrators. There are new agent frameworks every week, new ways to fan out, new patterns for multi-agent coordination. The orchestration layer is where the attention is.

Verification is the boring layer. Schemas, validators, contracts, format specifications. They do not trend. They do not get keynote announcements. They are not what gets the next round of funding.

But the part of a stack that ages best is usually the boring layer underneath. XSD will outlive my opinions about it. Schema validation will outlive whichever frontier model I am using this quarter. The orchestrator that wins this year will probably be replaced next year. The validation contracts the orchestrator was built around will still be there.

If I keep working on LAISI, that is the bet. Not “better workflow tool than the big labs.” Not “another orchestrator in the orchestrator pile.” But the validation layer that should sit underneath whatever workflow tool wins.

The orchestration layer is going to keep churning. The boring layer underneath will still be there. Schemas do not negotiate.

Ramp Me Up, Scotty!