The single-agent trap
"One agent that does everything" is not the goal. It is the bug. Here's the math on why, and what to do instead.
Every founder I talk to wants the same thing, described in roughly the same sentence. "One agent. It reads the email, figures out what the customer wants, checks the database, updates the record, sends the reply, logs the ticket, schedules the follow-up. One agent, one prompt, one call."
And every time I have built it that way, for myself, for clients, for demos I thought were going to be easy, it has failed. Not spectacularly. Not with an error. It fails in the particular way that AI systems fail when nobody is looking: it does four of the seven things, confidently reports success, and the other three silently do not happen.
The one-agent-to-rule-them-all dream is seductive because it matches how you would describe the task to a human employee. "Hey, can you handle the support inbox?" The human knows how to decompose that sentence into sub-tasks, how to notice when something needs more information, when to stop and ask, when to escalate, when to try again. They do all of it implicitly, without you having to specify any of it.
An LLM in a single prompt does not do this. It does something that looks like it. It generates a plausible token stream that resembles what a competent human would write if they were doing the task, and the gap between "resembles doing the task" and "does the task" is exactly where the silent failures live.
The failure mode, stated concretely
Here is what happens. You give the agent a prompt that says "read the email, check the database, update the record, send the reply," and four tools to do it with. You run it. Seven times out of ten it works. The other three times, it reads the email, decides the database check is not strictly necessary, skips it, writes a plausible reply based on what it thinks the answer probably is, and marks the task complete.
The customer gets a confidently wrong response. You find out on Twitter, if you are lucky. In a chargeback, if you are not.
This is not a prompting problem. You cannot solve it by adding "always check the database first" to the prompt. You can reduce the frequency, but you cannot eliminate it, because the same mechanism that makes the model flexible also makes it capable of skipping steps when it judges them unnecessary. The model is not broken. It is doing what you asked, which is to decide. The fix is to stop asking it to decide.
Language decisions versus control-flow decisions
Here is the rule I use, and I will defend it against anyone: the model should make language decisions, not control-flow decisions. Anywhere you can pull control flow out of the prompt and put it into code, do it. The prompt should be the smallest, most focused, most specific unit of work you can give it. "Extract these five fields from this email as JSON" is a great prompt. "Handle the customer's request" is not a prompt; it is a hope.
Break the task into a pipeline of smaller agents. Each has one job. Each has a tightly scoped prompt. Each hands off deterministically. Agent one reads the email and extracts structured fields, and that is all it does. Agent two queries the database and returns the result. Agent three drafts a reply from that result. Agent four checks the draft against a set of rules. And then a deterministic function, not a model, decides whether to send, escalate, or retry.
Yes, this is more code. Yes, it is less elegant. Yes, it feels like you are fighting the frontier. You are not fighting the frontier; you are using it correctly.
There is a version of this argument that ends in "so do not use agents, use functions." That version is wrong. Agents are real, valuable, and the right answer for lots of problems, especially the ones where the input is unstructured natural language and the output requires judgment. My argument ends differently: use many small agents, wire them together with code, and do not let any single model call carry more judgment than it can reliably deliver.
How do you know how much judgment a single call can reliably deliver? You measure. Run the same prompt a hundred times on real inputs and count how often it produces the same structured output. Below 95%, the prompt is doing too much; split it. Above 95%, you have found the right unit of work, and you can trust it within its narrow scope, the way you trust any other function in your codebase.
The math
Now the math. The math is the thing that makes this non-negotiable, and I have found that founders who do not believe the argument do believe the arithmetic.
Most single-agent systems I audit are trying to carry 70% reliability across seven sub-tasks. Do the multiplication. 0.7 to the seventh power is about 0.08. That is an 8% end-to-end success rate. The founder sees 70% per step and thinks the system is "mostly working." It is not. It is 8% working, and the other 92% is silent drift that nobody notices until a customer complains.
Now the same math for a pipeline of small agents at 98% each. 0.98 to the tenth is about 0.82, an 82% end-to-end success rate, from agents that individually look only modestly more reliable than the one big agent. The difference between 8% and 82% is not a rounding error. It is the difference between a system that ships and a system that embarrasses you in public.
,A chain is only as strong as its weakest link. English proverb, rediscovered monthly by every AI engineer.
The chain does not just break at the weakest link. It multiplies the weaknesses of every link along the way, and that multiplication is what makes single-agent systems collapse at scale.
Actionable conclusions
Audit your system for control flow buried in prompts. Every time you see a sentence like "decide whether to…" or "if X, then…" inside a prompt, that is a control-flow decision the LLM is being asked to make. Move it to code, even if the code is an if-statement. The model will be more reliable for having less to do.
Measure per-step reliability, not end-to-end accuracy. Run every prompt a hundred times on real inputs and record the output distribution. Anything below 95% structural consistency is a candidate for splitting; anything below 80% is a candidate for rebuilding entirely.
Wire sub-agents with deterministic code, not another agent. The temptation is to have a "planner" agent that decides which sub-agent runs next. Resist. A planner is just another big agent in disguise, and it will have all the same failure modes as the monolith you are trying to escape. Use a state machine, or a DAG, or a simple for-loop with if-statements. Boring wins.
Log every sub-agent's input and output separately, so you can find out which link in the chain broke. If all you have is "the whole pipeline failed," you are back in the dark. Per-step logging turns a five-hour debugging session into a five-minute one.
So when a founder tells me they want "one agent that does the whole thing," I tell them what I am telling you. That is not the goal; it is the bug. The goal is a pipeline of small, boring, reliable agents that each do one thing, wired together with code you can actually debug and a trace dashboard you can actually read. Split your agent. Your success rate will thank you. Your customers will too, even if they never know why.