What I Learned Optimising a Multi-Turn AI Agent (the Hard Way)
Lessons from building multi-turn AI agents across Zapier Agents, Notion Custom Agents, and Relevance AI — covering context window management, system prompt bloat, sub-agent trade-offs, and context rot.
Over the course of 2026, I've been building multi-turn AI agents for clients and for my own use across three platforms: Zapier Agents, Notion Custom Agents, and Relevance AI. While each platform has its own nuances, much of what I've learned is consistent across all of them: the way context windows actually work — and what we anthropomorphise as "memory" — has profound implications for cost, performance, and accuracy. But once you understand that, you can do something about it.
This feels especially relevant right now. Notion recently launched Custom Agents, and for a lot of users it's their first real experience building agentic workflows and bumping up against the cost and performance trade-offs that come with them. It's a fantastic product — I've been building with it extensively — but with the introduction of usage-based pricing, many users are watching Notion AI credits pile up without fully understanding why they scale the way they do. (Until May 4, these credits are free, but they’ll cost $10 per 1,000 credits after that).
Lots of Notion users are looking at this screen in horror before credits stop being free in May
The agents I've been building across these platforms orchestrate multi-step workflows, like pulling data from CRMs, retrieving strategy documents, walking users through confirmation steps, and producing structured outputs. And regardless of the platform, I kept running into the same walls:
Ballooning token bills
Degrading output quality
Agents that gradually stopped following their own instructions
Multi-agent workflows where one agent "forgot" what the other one was working on
If any of that sounds familiar, here's what I've learned the hard way, so you (hopefully) don't have to.
The "memory" illusion
This is super important: the conversational "memory" of an LLM is an illusion.
Every time you send a message, the model isn't continuing a conversation. What it’s actually doing is starting an entirely new conversation, with the full history of your prior exchanges loaded into the context window.
To us humans, it feels like a natural back-and-forth conversation. It’s a brilliantly designed illusion, but it has important implications for cost and performance. The incremental token cost of each successive message isn't just the new message itself; it's the entire conversation history replayed alongside it. This means that the token cost compounds with every turn. But cost is only part of the story: as context grows, the model's ability to follow instructions, recall earlier details, and produce accurate outputs degrades too. Every token in that window competes for the model's finite attention.
This is something I understood conceptually but hadn’t fully internalised when designing agentic workflows before. In a simple chatbot, where conversations tend to be short and the model isn’t interacting with a bunch of different systems and tools, it's manageable. But in a multi-turn agent with 8–10 workflow steps, each requiring user input and tool calls, the compounding hits you on three fronts: escalating token costs, declining output quality, and increasingly unreliable instruction-following.
I've seen this play out identically across Zapier Agents, Relevance AI, and Notion Custom Agents — it's a function of the underlying technology of LLMs, regardless of the platform sitting on top.
The document-in-context trap
One of the Relevance AI agents I built needed to retrieve a strategy document early in its workflow, then reference it across several subsequent steps. The original implementation fetched the full document via an API call and loaded it into the agent's context window.
The problem: that document — typically 8,000 to 10,000 tokens — didn't just occupy context once, as I initially assumed before thinking about it more carefully. Because every turn replays the full conversation history, those tokens were re-sent on every single subsequent interaction. Across 6–10 remaining workflow steps, that's 50,000–100,000 tokens of redundant context, all from a single document fetch, when all I actually needed from it was a few bullet points.
Once someone on the Relevance team pointed this out, the fix was straightforward: I built an intermediate extraction tool that pulls only the specific data the agent actually needs from the document (roughly 500–600 tokens), and returns only that to the agent. The raw document never enters the agent's context window. This single change was the largest cost reduction of the entire optimisation effort.
The broader principle: any time a tool returns data into an agent's context, that data persists for the life of the conversation. Design your tools to return the minimum viable information, not the full payload.
Why bloated system prompts cost more than you think
Another agent’s system prompt had grown to 644 lines over several iterations of feature additions. Long-winded confirmation scripts, convoluted backup logic for retrieving files, debugging notes that were never cleaned up, and dozens of all-caps warning labels (CRITICAL, MANDATORY, WAIT STATE) that all boiled down to the same instruction repeated in different ways.
This applies everywhere. Zapier Agent instructions, Notion Custom Agent instruction pages, Relevance AI system prompts — they all load on every single turn, just like conversation history. A 644-line system prompt doesn't just cost tokens once; it costs tokens on every interaction, compounding in exactly the same way as the conversation history.
With help from a separate AI agent that I spun up, I was able to cut the system prompt to 308 lines (basically in half) without any noticeable impact on output quality.
Not every agent should be a sub-agent
On another build, I tried to incorporate a multi-turn workflow agent as a sub-agent within a larger routing agent — a "master dispatcher" that would receive user requests and route them to the appropriate specialist.
The idea made architectural sense on paper: one front door, so that end users didn’t have to remember which agent to “talk” to for which tasks. Just talk to one main agent, and let it figure out which sub-agents to invite to the party. Eliminating this friction is really valuable, and one of the reasons Claude skills are a vast improvement over Custom GPTs in ChatGPT.
For simple, one-shot tasks (data lookups, status checks, record creation), this construct works well. The router dispatches, the sub-agent executes, and the result comes back in a single round-trip.
But for a multi-turn agent with its own prescribed sequence of steps, the pattern broke down in two ways.
First, the routing agent had no visibility into the sub-agent's internal workflow state. It could relay user input and pass back responses, but it couldn't “know” that the sub-agent’s next step was going to be an assumptions confirmation before generating an outline, or that the sub-agent was going to need to ask some follow-up questions before moving on. The result was the router skipping steps, jumping ahead in the sequence, and producing lower-quality outputs.
Second, the token cost doubled. With stateful exchanges enabled between the router and the sub-agent, both agents carried full conversation history in parallel context windows. Every turn was processed twice: once by the router, once by the sub-agent. This is fundamentally different from one-way handoff agents where the overhead is a single round-trip.
If we'd tried to fix the sequencing problem by embedding the sub-agent's full instructions into the router, we'd have effectively merged the two agents, defeating the purpose of the separation and creating an unwieldy mega-prompt that would accelerate context rot.
The solution was to keep the routing agent as a dispatcher for fire-and-forget tasks, and route users directly to the multi-turn agent for complex workflows. It means users need to know which agent to go to, but that's a lighter onboarding problem than persistent performance and cost issues. I suspect that eventually, multi-agent workflows will be largely or entirely replaced by skills, which is something that Relevance and Notion are already working on, but where Zapier Agents is currently lagging behind.
Context rot is real and it's not just about cost
The concept of "context rot" — where LLM performance degrades as more tokens accumulate in the context window — is well-documented but easy to underestimate in practice. It's not just that longer contexts cost more. The model's ability to accurately recall and follow instructions measurably decreases as the context grows.
Anthropic's research describes context as a finite resource with diminishing returns: every new token depletes an "attention budget." This has practical implications for agent design. Redundant emphasis markers in your system prompt don't just waste tokens; they actively compete for the model's attention with the instructions that actually matter, diluting its focus on the user's actual request.
The transformer architecture processes context through an attention mechanism where every token attends to every other token — an n² relationship. As context grows, the model's ability to maintain these relationships gets stretched thin. This isn't a theoretical concern; I've observed it directly across all three platforms in the form of skipped workflow steps and looser instruction-following as conversations grew longer.
After a year of building across these platforms, I've distilled a few principles that I apply to every agent build.
Treat every token in the context window as recurring cost, not one-time cost. System prompts, tool outputs, and conversation history all compound across turns. Design accordingly.
Build extraction layers between your data sources and your agents. Never let a raw API response or full document enter an agent's context window if you can extract just the relevant data first.
Keep system prompts lean and audit them regularly. Feature additions accumulate. Debugging artefacts linger. Verbose templates that "make things clearer" actually make them more expensive and, counterintuitively, less clear.
Match the agent pattern to the interaction pattern. Routing agents work well for dispatching one-shot tasks. Multi-turn workflow agents with prescribed sequences should operate standalone, not as sub-agents behind a router.
Respect the attention budget. Context rot means that the cost of a bloated context window is both financial and functional. A leaner context produces better outputs.
Sixth, if you find all of this overwhelming, you could always just bring in work.flowers to build it for you.
Notion Custom Agents and Zapier Agents take fundamentally different approaches to AI automation. Here's where each one wins — and where it falls short.
How I built a content system using Notion, Zapier, Readwise, and AI agents that produces original writing at scale — without sounding like everyone else.