Back to Blogs
GenAI / LLMsSystem DesignDeep Dive

RAG in Production: The Complete Engineering Guide

April 2025
22 min read
Deep Dive

Most teams ship a RAG prototype in a weekend. A working demo, a satisfying "it answered correctly" moment, and a feeling that production is just around the corner.

Then they hit production.

Queries that worked perfectly in testing start returning wrong answers. The retrieval that felt solid on 100 documents falls apart on 50,000. Users ask questions the embedding model never saw coming. And suddenly the thing everyone called "just a wrapper around an LLM" turns out to be a distributed systems problem in disguise.

This post is the guide I wish existed when I first built a RAG system for production. Every section answers a question that actually matters — not "what is RAG" but "what breaks, why it breaks, and how the teams that got it right actually fixed it."

80%
of RAG prototypes that fail in production fail at the retrieval layer, not the LLM — Morgan Stanley's own internal audit
3x
latency reduction Notion achieved by cutting ingestion lag from 24+ hrs to ~2 hrs using Kafka + Hudi CDC pipelines
4
distinct failure modes: retrieval recall, precision, grounding, and staleness — each needs a separate fix
$0
extra model cost for hybrid BM25+vector search — yet it's the single highest-ROI retrieval upgrade you can make

The Two Pipelines Every RAG System Has

Here is the first thing most tutorials skip: a production RAG system is not one pipeline. It is two completely separate workflows with different latency requirements, different failure modes, and different optimization strategies.

The offline indexing pipeline runs asynchronously. Its job is to take raw data — PDFs, HTML, database rows, API responses — and turn it into a searchable vector index. Latency here does not matter. Correctness does.

The online query pipeline runs synchronously on every user request. Its job is to take a query, find the most relevant context, and hand it to an LLM. Latency here matters enormously. Every extra step you add — reranking, HyDE, query expansion — adds milliseconds that compound at scale.

If you confuse the two, you make terrible tradeoffs. You optimize the wrong thing, add latency to the wrong step, and debug the wrong layer when something breaks.

Offline Workflow
Indexing Pipeline
Runs async — latency doesn't matter
01
Ingest
PDFs, HTML, code, DB rows, APIs
02
Parse & Clean
Preserve tables, headers, lists — don't flatten
03
Chunk
Structure-aware: respect semantic boundaries
04
Enrich
Prepend section title + doc metadata to every chunk
05
Embed
Same model you'll use at query time — always
06
Store
Vector DB + BM25 index in parallel
Online Workflow
Query Pipeline
Latency-critical — every ms counts
01
Query In
Raw user query arrives
02
Query Rewrite
HyDE / multi-query expansion (optional)
03
Hybrid Retrieve
Dense vector + BM25 sparse in parallel
04
Rerank
Cross-encoder reranker over top-K candidates
05
Context Pack
Deduplicate, truncate, inject metadata
06
Generate
LLM with citation-grounded system prompt

💡 Key Insight: Treat the indexing pipeline and the query pipeline as separate services from day one. Different SLAs, different error handling, different monitoring. Building them as one monolith makes both harder to debug and scale.

The Chunking Problem — And Why Fixed-Size is Always Wrong

Chunking is where most RAG systems die quietly. You pick a chunk size, maybe 512 tokens, maybe 1000 characters, and everything seems fine until someone asks a question whose answer spans a chunk boundary.

Fixed-size chunking does not respect meaning. It splits mid-sentence, separates a table from its header, and severs a code example from the comment that explains it. The retriever then finds a chunk that contains half an answer and confidently passes it to the LLM.

The LLM, unable to resist, fills in the other half from parametric memory. That is how you get confident wrong answers that pass retrieval metrics and fail on every real question.

What actually works:

Structure-aware chunking respects the document hierarchy. If a PDF has sections with H2 headings, chunk at the section boundary. If it is a code repository, chunk at function and class boundaries — not at 512-character intervals. Cursor does exactly this: code is chunked by function, class, and file, and every chunk is annotated with its file path and parent scope.

The parent-child pattern solves a different problem. Store large semantic units as "parent" chunks but index small, precise "child" chunks for retrieval. When you retrieve a child chunk, you return its parent to the LLM. This gives the retriever precision and the LLM context. It is the highest-ROI chunking change you can make on an existing system.

One rule applies everywhere: prepend the document title and section heading to every chunk. A chunk that reads "The maximum value is 4096" is useless without knowing it came from "PostgreSQL configuration limits — max_connections." Always provide that context.

Chunking Strategy Comparison
StrategyBest ForTrade-offProd Ready?
Fixed-sizeNever in prodSevers mid-sentence facts
Recursive CharacterGeneral textBetter, but still blind to meaning⚠️
Structure-AwareDocs with headings/tablesRequires good parsers
SemanticLong mixed-topic docsSlower, embedding cost at index time
Parent-ChildPrecision + context neededMore complex retrieval logic✅✅

Retrieval: Why Vector Search Alone Fails in Production

Vector search is semantically powerful and terrible at exact matches.

Ask a pure vector search system "what is the MAX_CONNECTIONS limit in PostgreSQL 16?" and it may return a conceptually similar chunk about database connection pooling — scoring it as more relevant than the actual documentation page that contains the exact term "MAX_CONNECTIONS."

This is not a model bug. It is a fundamental property of dense embeddings. They compress meaning into a fixed-size vector, trading exact-match precision for semantic flexibility. For most queries, this is fine. For queries involving version numbers, product names, configuration keys, error codes, or any domain-specific terminology, it is quietly catastrophic.

The fix is hybrid search: run BM25 (sparse keyword search) in parallel with dense vector search and merge the results using Reciprocal Rank Fusion (RRF). BM25 handles exact matches. Dense search handles paraphrase and semantic intent. RRF merges them without requiring you to tune weights by hand.

Perplexity builds its entire retrieval tier on this pattern. Weaviate ships it built-in. If you are building on Postgres with pgvector, you need to add a BM25 implementation separately — but the tradeoff is worth it.

After retrieval, add a cross-encoder reranker. The initial retrieval returns the top-K candidates efficiently but imprecisely. The reranker reads the full query and each candidate chunk together and produces a precise relevance score. Morgan Stanley's team found this single addition was the primary driver of their recall improvement from 20% to 80%.

Picking a Vector Database — The Honest Comparison

Every vector database comparison you will find online is written by someone who works at one of these companies. Here is a neutral breakdown based on what actually matters in production.

Vector Database Comparison
DatabaseHostingScaleCostBest ForAvg Latency
pgvectorSelf / Postgres< 1M vecsFreeYou already run Postgres~10ms
ChromaSelf-hosted< 5M vecsFree (OSS)Local dev & prototypes~5ms
PineconeManaged SaaSBillions$70+/moZero-ops, enterprise scale~20ms
WeaviateSelf / Cloud100M+ vecsFree OSS / $25+ cloudHybrid BM25+vector built-in~8ms
QdrantSelf / Cloud100M+ vecsFree OSS / $25+ cloudHigh-perf, Rust-native~4ms

⚠️ The pgvector trap: pgvector is a great choice if you already run Postgres and your corpus is under a million vectors. Past that, it struggles with ANN (Approximate Nearest Neighbor) performance and concurrent writes to the index. The mistake teams make is starting with pgvector at small scale and not planning for when they will need to migrate.

The 4 Failure Modes — And How to Debug Each One

A RAG system in production fails in four distinct ways. Most engineers treat all of them as "the LLM is hallucinating." They are not. Each requires a different fix.

Retrieval Failures
Low Recall
Add BM25 hybrid search — semantic search misses exact terms
Low Precision
Add a cross-encoder reranker after initial top-K retrieval
Bad Chunking
Switch to structure-aware or parent-child chunking
Embedding Drift
Re-index when you change embedding models — always
Generation Failures
Citation Hallucination
Force model to quote the exact chunk; if it can't, output 'insufficient context'
Context Bias
The model over-reads the first/last chunk — reorder by relevance, not retrieval order
Best-Guess Fill-in
Add explicit system prompt: 'If the answer is not in the context, say so'
Multi-hop Failure
Use an agentic loop that retrieves iteratively rather than one-shot retrieval
Operational Failures
Stale Index
Set up CDC (Change Data Capture) pipeline — Notion uses Debezium + Kafka
No Observability
Log every step: query → retrieved chunks → reranked → prompt → response
No Evaluation
Build a golden dataset of 50+ queries before shipping any v1
Permission Leaks
Filter retrieved chunks by user permissions at query time, not just index time

The most important debugging tool you can build is a trace log that captures, for every query: the original query, the rewritten query (if you use HyDE or multi-query), the retrieved chunks with their scores, the reranked order, the assembled prompt, and the generated response. Without this, you are debugging in the dark.

Langfuse is the best open-source option for this in 2025. LangSmith if you are already in the LangChain ecosystem. Both support the "LLM-as-judge" evaluation pattern where you sample production traces and score them automatically.

Advanced Retrieval: HyDE and When to Actually Use It

HyDE (Hypothetical Document Embeddings) solves a specific problem: the semantic gap between a short user query and the long documents that contain the answer.

Standard retrieval embeds the query directly and searches for similar vectors. But the query "what causes memory leaks in Go?" is semantically distant from a documentation page that explains it — because the documentation is written as statements, not questions.

HyDE inverts this. Instead of embedding the query, you use an LLM to generate a hypothetical answer to the query, then embed that. Now you are searching answer-to-answer instead of question-to-answer. The semantic match improves significantly.

The cost is an extra LLM call before every retrieval, adding 300-800ms of latency. Use a small, fast model (GPT-4o-mini or a local Ollama model) specifically for the HyDE step. Cache hypothetical embeddings for repeated queries.

When to use it: when your queries are short and vague and your documents are long and structured. When your users ask conceptual questions against technical manuals. When you have tried hybrid search and reranking and are still seeing poor retrieval precision.

When not to use it: when queries contain specific terms, product names, or version numbers. HyDE can generate a hypothetical answer that contains wrong terminology, sending your retriever in the wrong direction. Always pair it with a reranker as a safety net.

How to Actually Evaluate a RAG System

Most teams evaluate their RAG system by asking it ten questions and checking if the answers look right. This is not evaluation. This is vibes.

A real evaluation pipeline has three layers.

Retrieval
Context Recall
Did we retrieve all needed chunks?
Retrieved relevant / All relevant
RAGAS
Retrieval
Context Precision
Were retrieved chunks actually useful?
Relevant in top-K / K
RAGAS
Generation
Faithfulness
Does the answer stay inside the context?
LLM-as-judge vs retrieved chunks
RAGAS / Langfuse
Generation
Answer Relevancy
Does the answer address the actual question?
Cosine(query, generated_answer)
RAGAS
Retrieval
MRR
Is the most relevant chunk at the top?
1/rank of first relevant result
Custom eval

Building your golden dataset is the first step nobody wants to do and the most important thing you can do. A golden dataset is a fixed set of queries with known correct answers and known source chunks. You need at least 50 examples, ideally 200+.

Morgan Stanley's team started with 5 test cases and iterated to hundreds. Every regression — every time a change made a previously-correct answer wrong — was caught by this dataset before it reached users. The evaluation infra was, in their words, as important as the retrieval infra.

RAGAS is the best open-source framework for automated RAG evaluation. It uses an LLM as a judge to score faithfulness (does the answer stay within the retrieved context?), context precision (were the retrieved chunks relevant?), and answer relevancy (does the answer address the question?). It supports reference-free evaluation for production, where you do not have ground-truth answers for every query.

Run regression evals before every deployment. Sample 10% of production traces daily and score them. Alert when faithfulness drops below your baseline. These three things, done consistently, will catch 90% of production degradations before users notice.

RAG vs Fine-Tuning: The Decision Nobody Makes Correctly

The most common mistake in enterprise AI in 2025 is fine-tuning a model to learn facts. Fine-tuning is not for facts. It is for behavior.

If you need the model to know about your company's products, use RAG. The data changes, the model cannot be retrained every week, and you need citations for compliance.

If you need the model to always respond in a specific JSON schema, always use a specific tone, or always follow a specific clinical reporting format — fine-tune. That is a behavioral change, not a knowledge change. RAG cannot reliably enforce output format on its own.

The 2025 consensus from teams shipping production AI is: start with RAG. Add fine-tuning only when RAG gives you the right information but the wrong behavior.

RAG vs Fine-Tuning — Decision Matrix
When you need...RAGFine-Tuning
Data changes frequently?✅ Yes — update KB, done❌ Requires retraining
Need source citations?✅ Built-in❌ Opaque — hard to trace
Need custom tone/format?⚠️ Prompt engineering only✅ Yes — bake into weights
Upfront costLow — data engineeringHigh — $10K+ per run
LatencyHigher (retrieval overhead)Lower (self-contained)
Domain jargon⚠️ Depends on embedding model✅ Model learns terminology
Private/proprietary data✅ Data never enters model⚠️ Data in training pipeline
2025 consensus: Start with RAG. Add fine-tuning only for tone/format, never for knowledge.

Real Production Systems: What Actually Happened

Theory is one thing. Here is what four production teams actually built, what they learned, and what they would do differently.

Real World
Morgan Stanley
Wealth Advisor Knowledge Base
Scale
100,000+ internal documents
Outcome
Recall improved from 20% → 80% through iterative regression testing. Advisors get compliant, cited answers in seconds.
Stack
GPT-4 (OpenAI collab)Proprietary vector storeDaily regression test suite
Key Lesson
They started with 5 test cases and iterated to hundreds. The evaluation infra was as important as the retrieval infra.
Real World
Perplexity AI
Real-time Answer Engine
Scale
Billions of web pages, live index
Outcome
Sub-second cited answers grounded entirely in retrieved web sources. The LLM is forbidden from using parametric knowledge.
Stack
Vespa AI (retrieval)Hybrid BM25 + denseMulti-tier ML reranker
Key Lesson
Perplexity treats hallucination as a retrieval bug, not a model bug. If the answer is wrong, they fix the retriever.
Real World
Notion AI
Workspace Search + Summarization
Scale
10x data growth in 3 years
Outcome
Ingestion lag reduced from 24+ hours to ~2 hours. AI features now work on fresh data across 100M+ pages.
Stack
Apache Hudi + KafkaDebezium CDCSpark for processing
Key Lesson
Notion's bottleneck wasn't the LLM or the vector DB — it was getting data in fast enough. Freshness is a pipeline problem.
Real World
Cursor
Codebase-Aware Coding Assistant
Scale
Per-repo, local embedding
Outcome
@Codebase queries retrieve semantically relevant code chunks from private repos. Zero data leaves the machine.
Stack
turbopuffer (vector DB)Local embeddingsSemantic code chunking
Key Lesson
Cursor chunks code differently than prose — by function, class, and file boundary, not by character count. Domain matters.

The pattern across all four is the same: the bottleneck was never the LLM. It was always the data pipeline, the retrieval quality, or the evaluation infrastructure. The teams that shipped reliable RAG systems built those three things first and treated the LLM as the last mile.

The Production Checklist

Before you ship a RAG system, these are the non-negotiables:

Data Pipeline - Structure-aware or parent-child chunking (never fixed-size) - Every chunk contains document title and section heading - BM25 index built in parallel with vector index - Change Data Capture (CDC) pipeline for index freshness - Permission filtering at query time, not just index time

Retrieval - Hybrid BM25 + dense vector search with RRF merge - Cross-encoder reranker on top-K candidates - Separate retrieval latency SLA from indexing latency

Generation - System prompt explicitly instructs "say I don't know if the context is insufficient" - Citation requirement in every response - Temperature at 0.0–0.2 for factual queries

Evaluation - Golden dataset of 50+ queries before v1 ships - RAGAS or equivalent running on production sample daily - Full pipeline trace logging (query → chunks → prompt → response) - Regression test suite blocking deployment if faithfulness drops

Observability - Alert on retrieval score distribution shifts - Alert on faithfulness score degradation - User feedback loop (thumbs up/down) feeding back to eval dataset

⚠️ The one thing most teams skip: Permission-aware retrieval. If your knowledge base contains documents with different access levels, you must filter retrieved chunks by the querying user's permissions at query time, not just at index time. Failing to do this means a junior employee can indirectly retrieve senior executive documents through the AI interface — a real compliance failure that has happened in production.

The Honest State of RAG in 2025

RAG works. It works at Morgan Stanley's scale, at Perplexity's real-time pace, at Notion's data volume. It is not magic and it is not simple.

The teams that fail with RAG treat it as a prompt engineering problem. They tune the system prompt and wonder why the answers are still wrong. The teams that succeed treat it as an engineering problem: data pipelines, retrieval algorithms, evaluation frameworks, and observability.

The retrieval layer is where most of the leverage is. If you fix the retriever — hybrid search, better chunking, a reranker — the LLM's answers get dramatically better without touching the model at all. Start there. Always start there.