April 10, 2026—7 min read

The demo-to-deploy gap

A working demo and a working system are not the same species. Most AI builders are optimizing for the wrong one.

The forty-second AI agent demo has become its own little art form. An agent books a meeting, sends a Slack message, updates a CRM row. All in one take. The caption says "this is the future." The replies agree. The like count climbs.

Some time later I am on a call with that same founder, screen-shared into that same agent, and the thing is dead. Not broken-with-an-error dead, which would at least be honest. Something worse. The agent ran, returned a confident-looking result, and the CRM row never updated. Nobody noticed for days.

This is not a rare failure. It is the modal outcome of AI agents shipped in 2026, and the shape of it has become so predictable that I have stopped being surprised. The gap between "works in the demo" and "works on a Tuesday at 3am when nobody is watching" is where almost all the real engineering in an AI project lives. Almost nobody writes about it, because the work does not film well.

§ 01

The last ten percent

Here is the thesis, and once you see it you will see it in every AI project you audit: the last ten percent of an AI system is ninety percent of the work. The first ninety is the part you can film. It is getting cheaper every month whether you work on it or not, because the frontier models keep improving and the scaffolding keeps getting more common. The part that is not getting cheaper is the part that decides whether the thing actually ships. That is the part the market has no idea how to applaud yet.

Most builders are optimizing for the first ninety. This is the entire reason most AI agents never reach production. It is also why the founders who do reach production tend to look, from the outside, like they "just got lucky." They did not get lucky. They did the unfilmable work.

Let me be specific about what "the last ten percent" actually contains, because "production-ready" is one of those phrases that sounds like it means something and mostly doesn't.

It starts with auth. Auth seems like plumbing until you realize that a real system talks to a real CRM belonging to a real person, which means OAuth, which means token refresh, which means figuring out what happens when the token expires at 2am on a Saturday and the retry queue quietly fills up with 403s until someone notices on Monday morning. Your demo used an API key hardcoded in a .env file. Your customer cannot do that. Every real integration you ship is going to be a short lesson in how many ways a token refresh can go wrong when nobody told the model to care about it.

Then there is state. The demo ran once, in a clean room, against a fresh database that nobody else was touching. The real system runs a thousand times a day against data that real humans are editing at the same time, and half the interesting bugs only appear when two things happen in the wrong order. If you do not know what "idempotent" means in the context of your own tool calls, you have not shipped yet. I do not mean that as an insult; I mean it as a diagnostic.

Then retries. The LLM will return malformed JSON. The tool call will time out. The third-party API will rate-limit you, or return a 200 with an error buried in the body, or return the right shape with the wrong field name because someone shipped a breaking change on a Thursday and forgot to bump the version. Your agent needs to know what to do then. "It throws" is not an answer; "it retries forever" is a worse one.

Then observability, which is the category nobody builds until after they have already lost a customer. A deterministic bug is easy, you read the stack trace, you find the line, you fix it. An LLM bug is a probability distribution. It happens four percent of the time on Tuesdays, in a way you cannot reproduce locally, because the word "schedule" in one email triggered a subtly different chain of tool calls than the word "booking" did in the next. You cannot fix what you cannot see. "Check the logs" stops working when the logs are a thousand JSON blobs per hour and the bug is in the shape of the model's reasoning rather than in any one line of it.

Then the human-in-the-loop question, in its honest form. Not "should a human approve this," because everyone says yes to that. The real question is which step, with what UI, with what SLA, owned by whom, and what happens when that human is on vacation. If you cannot answer all four, you do not have a system. You have a cron job with a language model stapled to the front.

And then the edge cases you will not think of until a customer ships you a CSV with an emoji in a column header. I am not making that up. I have debugged that exact bug more than once, and the lesson was not that I remembered the fix—it was that I had not written it down.

None of this is in the forty-second clip. The clip is the happy path. The last ten percent is everything else, and everything else is where the whole game lives.

§ 02

Demos and systems are different species

Here is the reframe, and it is the only one that has actually helped me ship things. A demo and a system are not the same thing at different levels of maturity. They are different species. Different selection pressures. Built by people with different instincts.

A demo is a performance. Its job is to convince a human, in forty seconds, that something is possible. The skills that make a great demo are the skills of a magician: surprise, narrative, a clean happy path, a steady hand with the curtain. Demos reward novelty.

A system is an operation. Its job is to do something, correctly, while no one is looking. The skills that make a great system are the skills of a plumber: tolerance, boredom, relentless instrumentation, and a deep suspicion of anything that worked on the first try. Systems reward boredom.

Most people who are great at the first are mediocre at the second, because the feedback loops reward completely different instincts. I have watched founders ship a gorgeous demo in an afternoon and then spend six months failing to get that same demo stable enough to sell. Not because they are incompetent. Because the instincts that got them to the demo, speed, flair, refusing to sweat the edge cases, are exactly the wrong instincts for production. The demo-building self and the system-building self are in tension. The people who ship are the ones who learn to switch between them.

§ 03

The v1 that reports success while silently failing

This happens more often than anyone admits. An agent ships, reports success to the logs — confident summary, 200 status, moves on. The actual action, the part the customer is paying for, silently fails. Nobody notices until the customer emails. Politely. Which is worse than if they had yelled.

The fix is real observability: per-step assertions, a dead-letter queue for failed tool calls, a customer-facing UI that lets them see every run and its outcome. The version without instrumentation lasts days. The version with it runs indefinitely.

The embarrassing part is never that v1 broke. Everything breaks. The embarrassing part is shipping without the instrumentation that would have told you it was broken — building a demo and calling it a system. Filming a clip of it working and convincing yourself the clip is the product.

§ 04

Actionable conclusions

If you want to close the demo-to-deploy gap on your own project, here is the shortest list I can defend. I have tested each of these on enough real projects to know they work.

Treat the demo as a hypothesis, not a proof. A working demo means the model can do the thing, in isolation, on a good day. It does not mean the system can do the thing reliably, at scale, inside a real customer's world. Those are different claims, and conflating them is the most expensive mistake being made in the field right now.

Build the observability layer before you build the agent. I wrote a whole separate essay on this one. The short version: a table of every run, with nested tool calls, searchable, visible to you and to the customer, is not the last thing you add. It is the first thing you build, and nothing else you ship will repay itself as quickly.

Budget your time backwards. If the project takes six weeks, allocate one week to the demo and five to the last ten percent. Most builders do the opposite ratio. That is also why most builders do not ship.

Write the kill criteria before you start. "If I cannot get this above 95% reliability on the benchmark set I will define this week, I will kill the project." You should be able to say that sentence on day one. If you cannot, you are committing to a project whose failure mode is "we keep working on it forever", the most expensive outcome of all, worse than failing outright and much worse than succeeding.

Listen to what broke, not to what worked. Every week of an AI project should end with a review of the failed runs, not the successful ones. Your instincts will pull you toward the good runs because they feel like validation. The bad runs are the ones with the information in them.

And when a founder tells you their demo worked, congratulate them, and then ask how they would know if it stopped working tomorrow. If the answer is not specific, you have just found the next thing they need to build.

So: which one are you building? If you are honest and the answer is "a demo that I am calling a system," that is fine. It is where almost everyone starts, and it is where I started. The question is whether you know it, and whether you are willing to put in the unglamorous week that closes the gap, or whether you are going to keep shipping the ninety percent and quietly wondering why the revenue never shows up.

The demo-to-deploy gap is not a technical problem. It is a discipline problem. The discipline is cheap; the people who practice it are rare. Be one of them.

← More essays