The demo works.
An AI agent reads an inbound lead, researches the company, drafts a personalized email, sends it, and queues a follow-up if nobody replies. The CTO shows it internally. People get excited. Budget gets discussed. Someone says, "Let's get this live in 6 to 8 weeks."
Then the project slows down.
Not because the team is weak. Not because the model suddenly got worse. And usually not because the prototype was fake in a malicious sense.
It slows down because the prototype answered the easy question: can the model produce a useful output on a clean happy path? Production asks a much harder question: can the system do it repeatedly, safely, with state, auditability, retries, and real-world messiness when nobody is watching?
That gap is where most agent projects actually die.
I know this pattern because I have lived both sides of it. I spent 7 years at Oracle working on Eloqua, where the hard part was never just generating output. The hard part was always state, scheduling, reliability, bad inputs, and operational correctness at scale. I now run CoEdify, where we build agentic systems for real business workflows. The same lesson keeps repeating: the model is only one layer. The system around the model is the product.
This is also where a lot of current agent discourse is still too shallow. Even the platform guidance is moving toward runs, guardrails, orchestration, and durable execution as first-class concerns. That is not accidental. Once an agent can call tools and wait on real-world events, prompt quality stops being the bottleneck. Systems design becomes the bottleneck.
What the prototype hides
Here is what a typical agentic prototype looks like. Say it is an SDR agent:
Agent receives lead -> researches lead -> drafts email -> sends email
You run it on 10 hand-picked leads. It works. The emails are good. The follow-ups land. You demo it. Everyone is impressed.
Here is what the prototype is quietly assuming:
- The input is clean. In production, some leads have missing fields, wrong emails, stale data, or conflicting records.
- The tool calls behave. In production, APIs timeout, rate-limit, return partial results, or silently drift.
- The workflow is fresh. In production, that same person may already be in another campaign, another stage, or another thread.
- The conversation is linear. In production, real threads fork. People reply late, forward emails, loop in teammates, or answer on a different channel.
- The run is one-shot. In production, the system has to remember what happened yesterday and decide what to do next week.
- A human is watching. In production, failures surface after the fact, often through customer confusion rather than clean error messages.
The prototype does not handle any of this because it was never designed to. A prototype proves the happy path. Production is everything that happens around the happy path, and that is where most of the engineering lives.
The six things that kill agentic systems in production
After building and shipping agentic systems, these are the failure modes we see repeatedly. Different use case, same pattern.
1. State amnesia
The prototype does not remember anything between runs. It processes each lead in isolation. In production, that is catastrophic.
A lead was drafted on Monday. A different agent thread picks up the same lead on Tuesday and drafts a completely different email. Two emails go out. The lead is confused. You look unprofessional.
Or worse: a lead replies "not interested." The agent does not register the reply because it has no memory. It sends a follow-up. Then another. You have just spammed someone who already said no.
What production requires: A state management layer that records what happened, when, and what the outcome was. Before the agent acts, it reads the state. After it acts, it writes the state. Every interaction is an append-only log.
If you do not solve this first, every other improvement is fragile.
2. The model returns garbage and nobody notices
In the prototype, the model always returns a well-formed email. In production, sometimes it returns:
- An email addressed to the wrong person
- A 2000-word essay instead of a short note
- A message that hallucinates a product feature you do not have
- An empty string
- A JSON object instead of plain text
The prototype does not validate outputs. It assumes the model will be reasonable. Production cannot afford that assumption. You need structured validation on every model output before it reaches a customer, a candidate, or a downstream system.
from pydantic import BaseModel, field_validator
class AgentOutput(BaseModel):
subject: str
body: str
cta: str | None = None
@field_validator("body")
@classmethod
def body_must_be_reasonable(cls, v: str) -> str:
if len(v) < 20:
raise ValueError("Body too short or empty")
if len(v) > 500:
raise ValueError(f"Body exceeds 500 chars (got {len(v)})")
if "we offer" in v.lower():
raise ValueError("Output contains unverified product claim")
return v
@field_validator("subject")
@classmethod
def subject_must_exist(cls, v: str) -> str:
if not v or len(v) < 5:
raise ValueError("Subject missing or too short")
return v
This is not prompt engineering. This is production engineering. You do not solve it by asking the model more politely. You solve it by validating the output against a contract, just like any other untrusted external response.
3. Context window starvation
The prototype sends the model a clean, curated context: the lead's name, company, and a short prompt. The model handles it beautifully.
Production reality looks different. The agent may need the interaction history, current stage, prior replies, research notes, workflow rules, permission scope, and brand constraints. Teams dump all of it into the context window and call that "memory." It is not memory. It is a context dump.
We saw this exact problem in our own systems. The output was becoming generic for the most important contacts. The issue was not the model. The issue was context assembly. We were sending too much and structuring too little, so the model focused on nothing.
The fix is not a bigger context window. It is smarter context assembly.
We built a three-layer context model for Cortex, our memory service:
Layer 1: Research profile -> Written once, read often
Layer 2: Workflow context -> Mutable, compressed
Layer 3: Cross-workflow memory -> Accumulated insights
Before the agent acts, it gets one API call that assembles the right slice from all three layers. Not everything. The relevant slice.
import httpx
def get_agent_context(tenant_id: str, contact_id: str, workflow_id: str) -> dict:
"""One call, three layers, curated slice."""
resp = httpx.get(
"https://cortex.api/v1/context",
params={
"tenant_id": tenant_id,
"contact_id": contact_id,
"workflow_id": workflow_id,
"layers": "profile,workflow,memory",
},
)
resp.raise_for_status()
return resp.json()
# Returns a compact, high-signal context bundle
# instead of a raw dump of everything we know
4. No one knows when it breaks
The prototype breaks, and you see it break. You are watching. Production breaks, and nobody notices until a customer complains - or worse, until you realize the agent has been sending broken emails for days.
This is the observability gap. Most agentic prototypes have no monitoring, no alerting, and no structured trace of the decision path. When something goes wrong in production, the first serious question is always the same: "What exactly did the agent do?" If you cannot answer that quickly, you do not yet have an operational system.
What production requires:
[2026-04-14T02:31:07Z] INFO agent.run.start lead_id=L-4521 workflow=outbound_v2 stage=drafting
[2026-04-14T02:31:08Z] INFO cortex.context.read lead_id=L-4521 layers=[profile,workflow,memory] tokens=847
[2026-04-14T02:31:12Z] INFO model.call.complete model=claude-3.5 tokens_in=1247 tokens_out=312 latency_ms=4100
[2026-04-14T02:31:12Z] WARN output.validation lead_id=L-4521 issue=body_exceeds_max_length chars=687
[2026-04-14T02:31:12Z] INFO output.truncated lead_id=L-4521 original=687 truncated_to=450
[2026-04-14T02:31:13Z] INFO cortex.state.write lead_id=L-4521 stage=drafted->queued_for_review
[2026-04-14T02:31:13Z] INFO agent.run.complete lead_id=L-4521 duration_ms=6100 status=success
Every run should leave a trail: when it started, what context it read, what the model returned, whether validation passed, what state changed, and how long it took. When something breaks, you reconstruct the run instead of guessing.
5. The agent that runs forever
Prototypes are one-shot. You run them, they produce output, you stop. In production, agents are long-running. They wait for replies. They send follow-ups after delays. They retry on failure.
This means you need durable execution. If the server restarts, the agent needs to pick up where it left off, not re-process the workflow from the beginning. If a timer is set for "follow up in 3 days," that timer has to survive deploys, crashes, and outages.
We use Temporal for this. Every agent workflow is a durable function. Timers, retries, and state persistence are handled by the execution layer, not buried inside agent code.
from datetime import timedelta
from temporalio import workflow
@workflow.defn
class OutboundWorkflow:
@workflow.run
async def run(self, contact_id: str) -> None:
# Step 1: Draft
draft = await workflow.execute_activity(
draft_message, contact_id,
start_to_close_timeout=timedelta(seconds=30),
)
# Step 2: Wait for human review
approval = await workflow.execute_activity(
wait_for_approval, contact_id,
start_to_close_timeout=timedelta(minutes=30),
)
if approval == "REJECTED":
await workflow.execute_activity(log_rejection, contact_id)
return
# Step 3: Send
await workflow.execute_activity(
send_message, contact_id, draft,
start_to_close_timeout=timedelta(seconds=15),
)
# Step 4: Wait 3 days for reply
await workflow.sleep(timedelta(days=3))
# Step 5: Check for reply, follow up if needed
has_reply = await workflow.execute_activity(
check_reply, contact_id,
start_to_close_timeout=timedelta(seconds=10),
)
if not has_reply:
followup = await workflow.execute_activity(
draft_followup, contact_id,
start_to_close_timeout=timedelta(seconds=30),
)
await workflow.execute_activity(
send_message, contact_id, followup,
start_to_close_timeout=timedelta(seconds=15),
)
If the server crashes at step 3, the workflow resumes from the last completed step. The lead does not get a duplicate email. The agent does not lose its place.
Prototypes do not need this. Production cannot survive without it.
6. Multi-tenant isolation
The prototype runs for one user. One set of leads. One brand voice. One workflow.
Production runs for every customer. Customer A's agent cannot see Customer B's data. Customer A's workflows cannot affect Customer B's pipeline state. This is not a nice-to-have. It is a hard requirement.
We enforce multi-tenant isolation at the database level using composite foreign keys. Every query is scoped to the tenant. Not just at the application layer. At the schema layer. The agent cannot accidentally cross tenant boundaries because the data model itself prevents it.
CREATE TABLE pipeline_stages (
tenant_id UUID NOT NULL,
stage_id UUID NOT NULL,
contact_id UUID NOT NULL,
stage_name VARCHAR(50),
entered_at TIMESTAMP,
PRIMARY KEY (tenant_id, stage_id),
FOREIGN KEY (tenant_id, contact_id) REFERENCES contacts(tenant_id, id)
);
Cross-tenant data leakage is not a bug you want to find in production. It is an architecture you prevent upfront.
The production readiness checklist
Before any agentic system goes to production, we run through this checklist. Not because we like process. Because each item exists to prevent a failure mode we have already seen in practice.
- State management: Does the agent read state before acting and write state after acting?
- Output validation: Does every model output pass through a structured validator before reaching the customer?
- Context assembly: Does the agent receive a curated context slice, not a dump of all data?
- Observability: Can you reconstruct any agent decision from structured logs?
- Durable execution: Does the agent survive server restarts without duplicating work?
- Multi-tenant isolation: Is tenant scoping enforced at the database level?
- Error handling: Does the agent have a defined behavior for API timeouts, empty responses, and malformed data?
- Human-in-the-loop: Is there a review gate before customer-facing actions?
- Rate limiting: Does the agent respect API rate limits and model token limits?
- Idempotency: If the workflow runs twice on the same input, does it avoid duplicating external side effects?
If several of these are still unresolved, you do not have a production system yet. You have a promising demo.
The real cost of the prototype lie
The most expensive thing about the prototype gap is not re-engineering. It is the trust gap.
The team demoed something that worked. Leadership expected it to ship. When it does not ship - when month after month passes with "we are still working on edge cases" - leadership stops believing the team can deliver. The AI initiative gets labeled experimental. Budget gets cut. Momentum dies.
The problem is building the prototype as if production were a deployment step instead of an architectural requirement.
Production is not something you bolt on after the demo. It changes how you design state, context, workflow execution, permissions, review gates, and observability from day one.
The teams that move faster in the long run are usually not the teams with the flashiest demo. They are the teams that treat the agent as one component inside a disciplined system.
The prototype proves the model can do the task once. The architecture proves the system can do the task repeatedly, safely, and with accountability. That is the difference that matters.
At CoEdify, we build agentic systems for real business workflows. That means state, context, validation, durability, and operational guardrails are part of the system from the start, not phase two. [coedify.com]