Most RAG systems do well in demos and disappoint in live workflows.
The demo question is usually clean:
"What does the documentation say about X?"
The production question is not:
"What do we know about this account, what policy applies, what changed last week, and what should we do next?"
That is where many teams get stuck.
They built a document retrieval system, then expected it to behave like a working memory layer.
Those are not the same thing.
At CoEdify, we ran into this across different systems. The pattern became obvious very quickly: flat RAG works for reference material, but it breaks once the agent also needs live context and accumulated history.
Where RAG actually works
RAG is useful when the question is fundamentally about reference material:
- product documentation
- policy pages
- process guides
- specs
- contracts
In those cases, semantic retrieval is a good fit.
You have a corpus of relatively stable text. The user asks a question. The system retrieves the most relevant passages. The model answers from those passages.
That is a valid pattern.
One place where this works well for us is DevSko.
In DevSko, a job description can be used to retrieve relevant skill areas and role-specific question patterns for the assessment flow. That is a good RAG use case because the system is matching a JD against a curated reference layer of skills, question types, and role expectations.
The retrieval problem there is:
- what skills matter for this role
- which questions best test those skills
- which assessment pattern fits this job family
That is reference retrieval.
It is not the same as asking:
- how this candidate performed three steps ago
- what signals already appeared in the interview
- what follow-up should happen next in this assessment
Those are state and workflow questions, not RAG questions.
The mistake is assuming the same pattern should also handle:
- current workflow state
- recent interactions
- account-specific context
- evolving relationship history
- decisions that were made outside the document corpus
That is where disappointment begins.
Why production RAG feels unreliable
The issue is usually not that the embedding model is weak or the vector database is bad.
The issue is that teams put different kinds of knowledge into one retrieval path and expect the model to sort it out.
In practice, the failures usually look like this.
1. The system retrieves the right topic but the wrong version
A query about refund policy, pricing terms, or approval rules may return an older document because it is semantically similar and richly worded.
The retrieval engine sees similarity.
It does not automatically understand which version is current unless you explicitly model freshness and status.
2. The system retrieves general knowledge when the question is specific
The user asks about one account, one workflow, or one open issue.
The system returns a broad document because that is what exists in the vector index.
That answer may be relevant in theme and still wrong in practice.
3. The system misses the most important knowledge entirely
A lot of operational truth does not live in documents.
It lives in:
- interaction logs
- stage changes
- call summaries
- decisions captured in notes
- approvals recorded in workflow state
If those facts are not in the retrieval design, the model answers from incomplete evidence.
4. The answer ignores what the system already learned earlier
The last call changed the direction of the account.
The latest approval changed the next action.
The customer's objection pattern is already visible from previous interactions.
But the answer is generated from static chunks, not from the latest state plus memory.
That is why it sounds plausible and still feels wrong.
The core mistake: flattening knowledge
Most disappointing RAG systems have one architectural habit in common:
they flatten everything into a vector store.
That means:
- documents
- workflow state
- interaction summaries
- customer-specific context
- accumulated relationship knowledge
all end up being treated like equivalent retrieval objects.
They are not equivalent.
They have different lifecycles, different scopes, and different retrieval needs.
When you flatten them, the model gets semantically related text but not the right operational context.
That is why the answer sounds informed but still misses the situation.
The fix: separate the knowledge layers
The pattern that worked for us was not "better RAG."
It was separating the knowledge system into three layers and giving each layer the right retrieval pattern.
Layer 1: Reference knowledge
This is the classic RAG layer:
- policies
- documentation
- product specs
- standard agreements
- internal playbooks
This layer changes relatively slowly and benefits from semantic retrieval, metadata filtering, and clear versioning.
RAG belongs here.
Layer 2: Engagement context
This is the live state around a specific entity or workflow:
- current stage
- recent interactions
- latest notes
- pending actions
- scoped account context
This is not a semantic search problem.
This is a structured retrieval problem.
If the agent knows the contact_id, workflow_id, account_id, or ticket_id, it should fetch the relevant state directly.
Layer 3: Accumulated memory
This is the compressed history that helps an agent act with continuity:
- how this person usually responds
- what objections keep repeating
- what constraints matter in this relationship
- what the system has learned across prior interactions
This should not be reconstructed from raw history every time.
It should be maintained as a compact memory layer that is easy to read and cheap to use.
What this looks like in practice
The retrieval path should match the kind of knowledge being requested.
If the agent needs policy guidance, query the reference layer.
If DevSko needs to turn a JD into a role-relevant assessment shape, query the reference layer.
If the agent needs the current state of a deal or ticket, fetch the engagement layer.
If the agent needs continuity across prior interactions, read the memory layer.
A simplified pattern looks like this:
def build_agent_context(query, contact_id, workflow_id):
return {
"reference": search_reference_knowledge(
query=query,
filters={"status": "current"},
),
"engagement": cortex_get_context(
contact_id=contact_id,
workflow_id=workflow_id,
),
"memory": cortex_get_memory(
contact_id=contact_id,
),
}
The point is not the function names.
The point is that one retrieval method should not pretend to solve all three problems.
DevSko is a good example of that boundary.
RAG can help identify which skills and question families are relevant to a given JD.
But once a candidate is in an active assessment flow, the system also needs structured state:
- which stage the candidate is in
- which answers were already given
- which skills have already been tested
- what follow-up is still missing
If all of that gets flattened into one retrieval layer, the assessment experience becomes inconsistent for the same reason most production RAG systems do.
Why this matters to CTOs
If your team is building agents for support, sales, operations, hiring, or internal workflows, flat RAG creates a reliability ceiling.
The model may look fluent.
The system will still fail in situations where:
- freshness matters
- scope matters
- state changes quickly
- prior interactions should influence the response
That is why many teams keep trying to improve chunking, embeddings, and reranking while the user experience barely improves.
They are optimizing the wrong layer.
The bigger architectural question is:
what knowledge belongs in retrieval, what belongs in state, and what belongs in memory?
Once that is clear, the rest of the design becomes much simpler.
A practical diagnostic
If your RAG pipeline keeps underperforming, check these in order.
1. Are you using RAG for a state problem?
If the question depends on live workflow state or recent interactions, start there. Do not expect vector search to recover system state cleanly.
2. Are stale and current documents mixed together?
If yes, your retrieval layer needs explicit freshness and version controls.
3. Are account-specific facts mixed with general documents?
If yes, separate scoped context from reference knowledge.
4. Does the model have access to compact memory?
If every answer has to reconstruct history from raw records, the system will be expensive and inconsistent.
5. Are you evaluating the retrieval layer, not just the final answer?
When the answer is weak, many teams blame the model first. Often the real problem is that the evidence bundle was incomplete, stale, or scoped incorrectly.
The real takeaway
RAG is not broken.
It is just being asked to do too much.
Use it for what it is good at: retrieving reference knowledge from stable text.
Do not force it to carry live workflow state and relationship memory at the same time.
If you separate those concerns, the system gets sharper very quickly:
- better answers
- cleaner retrieval
- less prompt bloat
- more reliable agent behavior
That is the lesson we took from building Cortex.
The breakthrough was not a smarter embedding model.
It was recognizing that a production knowledge system needs reference retrieval, structured context, and memory as distinct layers.
At CoEdify, we build production knowledge systems for AI workflows. RAG is one layer, not the whole architecture. The systems that work in production separate retrieval, state, and memory from the start. [coedify.com]