Building a Production RAG Pipeline: Lessons Learned
The Gap Between Demo and Production
Every RAG demo looks the same: load a PDF, embed it, ask a question, get an impressively relevant answer. The demo works because you wrote the question knowing what is in the document. You tested the exact happy path.
Production RAG is different. Users ask questions you did not anticipate. Documents have inconsistent formatting. The embedding model you chose six months ago has been superseded. Retrieval latency spikes on large document sets. The LLM confidently answers questions using context that does not actually contain the answer.
I have built three RAG systems that reached production scale. This post is everything I wish someone had told me before the first one.
The Pipeline Architecture
A production RAG pipeline has two distinct phases that run at different times.
The ingestion pipeline runs once (and then incrementally as documents change). It is batch, offline, and can take minutes. The query pipeline runs on every user request. It must be fast. These two pipelines have completely different optimisation targets.
Ingestion: Where Most Mistakes Live
Chunking strategy is not a footnote
Most tutorials tell you to split every 512 tokens with 50-token overlap and move on. That is fine for demos. It is wrong for production.
The right chunking strategy depends on your document structure:
| Document type | Strategy | Reasoning |
|---|---|---|
| Structured docs (API reference) | Section-based | Respect heading hierarchy |
| Prose documents | Semantic chunking | Keep related ideas together |
| Tables / spreadsheets | Row-group chunking | Preserve row context |
| Code files | Function-level | Semantic unit is the function |
| Chat/transcript | Turn-based | Each turn is a unit |
Semantic chunking uses an embedding model to detect topic shifts and splits at natural boundaries rather than at fixed token counts. It produces better retrieval at the cost of more preprocessing time.
from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Semantic chunker — splits at embedding similarity drops
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split when similarity drops below 95th percentile
)
chunks = splitter.create_documents([document_text])Metadata is a retrieval multiplier
Every chunk should carry metadata. Not just the document title — structured metadata you can filter on during retrieval.
{
"chunk_id": "doc-abc-chunk-042",
"document_id": "doc-abc",
"document_title": "Kubernetes Networking Guide",
"section": "Network Policies",
"subsection": "Egress Rules",
"created_at": "2026-01-15",
"version": "1.4.2",
"content_type": "technical_docs",
"chunk_text": "Egress network policies restrict outbound traffic from pods..."
}When a user asks "what changed in version 1.4?", you can pre-filter to version = "1.4.*" before running similarity search. This dramatically improves precision and reduces the cost of retrieval.
Embedding model stability
Here is the production problem nobody warns you about: embedding model drift.
You embed your document corpus with text-embedding-ada-002. Six months later you switch to a better model. The new model's vector space is incompatible with the old one. Your query embeddings no longer align with your stored document embeddings. Retrieval quality collapses silently.
Rule: tag every embedding with the model name and version. When you upgrade the embedding model, re-embed the full corpus into a new namespace and test retrieval quality before switching.
Retrieval: Getting the Right Chunks
Approximate Nearest Neighbour is not free
At 10,000 documents, ANN search is fast regardless of index settings. At 10 million chunks, index choice matters enormously.
For most enterprise RAG systems, HNSW in pgvector or Weaviate is the right choice. You get SQL familiarity, metadata filtering on the same index, and recall > 99% with properly tuned ef_search parameters.
-- pgvector: filtered similarity search with metadata
SELECT chunk_id, chunk_text, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE document_type = 'technical_docs'
AND created_at > '2025-01-01'
ORDER BY embedding <=> $1
LIMIT 10;Hybrid search beats pure semantic search
Semantic search is great for conceptual queries. Keyword search is great for exact terms (product codes, names, version numbers). Hybrid search combines both.
from pinecone import Pinecone
from rank_bm25 import BM25Okapi
def hybrid_search(query: str, k: int = 10, alpha: float = 0.7) -> list[dict]:
"""
alpha=1.0: pure semantic, alpha=0.0: pure BM25
alpha=0.7 is a good starting point
"""
# Semantic results
query_embedding = embed(query)
semantic_results = vector_store.query(vector=query_embedding, top_k=k * 2)
# BM25 keyword results
bm25_scores = bm25_index.get_scores(query.split())
bm25_results = get_top_k_by_score(bm25_scores, k=k * 2)
# Reciprocal rank fusion
return reciprocal_rank_fusion(
semantic_results, bm25_results,
weights=(alpha, 1 - alpha),
top_k=k
)In my experience, hybrid search with alpha=0.6–0.8 consistently outperforms pure semantic search on enterprise document sets, particularly for technical documentation with lots of specific terminology.
Reranking: the quality booster you should budget for
ANN retrieval optimises for speed, not precision. A cross-encoder reranker takes the top-K chunks and re-scores them with a more expensive but more accurate model. You typically retrieve 20 chunks with ANN and rerank to the top 5.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_n]]Reranking adds 50–200ms latency. Whether it is worth it depends on your latency budget. In my testing it improves answer quality by 15–25% on complex queries. For a customer support RAG system, that difference directly impacts resolution rates.
Prompt Assembly
The prompt is not an afterthought. It is a critical system component that determines how well the LLM uses the retrieved context.
SYSTEM_PROMPT = """You are a helpful assistant answering questions based on the provided context.
Rules:
- Answer ONLY using information from the context below
- If the context does not contain enough information to answer confidently, say so explicitly
- Quote the relevant section when possible
- Never invent information not present in the context
"""
def assemble_prompt(query: str, chunks: list[dict]) -> list[dict]:
context_sections = []
for i, chunk in enumerate(chunks, 1):
context_sections.append(
f"[Source {i}: {chunk['document_title']}, {chunk['section']}]\n"
f"{chunk['chunk_text']}"
)
context = "\n\n---\n\n".join(context_sections)
return [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
]Include source metadata in each context section. This allows the LLM to cite sources and gives you the ability to verify answers against the retrieved chunks.
Evaluation: How to Know If It Works
This is the most neglected part of RAG systems. You cannot ship without knowing your retrieval quality.
RAGAS is the standard evaluation framework:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
# Your golden eval dataset
eval_data = {
"question": ["What is the maximum file size?", ...],
"answer": [rag_pipeline.answer(q) for q in questions],
"contexts": [rag_pipeline.retrieve(q) for q in questions],
"ground_truth": ["The maximum file size is 100MB", ...],
}
result = evaluate(
Dataset.from_dict(eval_data),
metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
print(result)
# faithfulness: 0.91 | answer_relevancy: 0.87 | context_recall: 0.83Run evaluation on every significant change: new embedding model, new chunking strategy, new LLM, new prompt template. Track scores over time. Treat a drop of more than 5% as a regression.
Production Failure Modes
These are the things that will page you:
1. Empty context retrieval: The query finds no relevant chunks (similarity below threshold). The LLM receives empty context and either refuses to answer or hallucinates. Fix: always return a minimum number of chunks; add a fallback response for low-confidence retrievals.
2. Context window overflow: 20 retrieved chunks × 500 tokens each = 10,000 tokens. Many models have 8K context limits. You either truncate silently or hit an API error. Fix: enforce a token budget in prompt assembly, not just a chunk count.
3. Stale embeddings: Documents updated in the source system but not re-ingested. The vector store serves outdated context. Fix: build an incremental ingestion pipeline that monitors document changes and re-embeds modified documents within SLA.
4. Embedding model rate limits: Your ingestion pipeline hammers the embedding API and gets throttled. Fix: add exponential backoff, use batched embedding calls, implement a queue with rate limiting.
5. LLM latency spikes: p99 latency of 8 seconds on a feature you promised would be "instant". Fix: set aggressive timeouts, stream responses to the UI, give users a typing indicator.
The Production Checklist
RAG Production Readiness
═══════════════════════════════════════════
Ingestion
☐ Chunking strategy validated against document types
☐ Metadata schema defined and populated
☐ Embedding model version tagged on every vector
☐ Incremental ingestion pipeline operational
☐ Re-ingestion playbook for model upgrades
Retrieval
☐ Index type appropriate for corpus size
☐ Hybrid search tested vs pure semantic
☐ Reranker evaluated (quality vs latency trade-off)
☐ Low-similarity fallback implemented
Generation
☐ System prompt reviewed and tested
☐ Token budget enforced in prompt assembly
☐ Source attribution included in context
☐ Streaming response implemented for UI
Evaluation
☐ Golden eval dataset of 100+ questions created
☐ RAGAS baseline scores recorded
☐ Evaluation runs in CI on model/prompt changes
☐ Faithfulness threshold for production deploy set
Operations
☐ Retrieval latency tracked (p50, p95, p99)
☐ LLM call cost tracked per query
☐ Empty context rate monitored
☐ Stale document alert configured
═══════════════════════════════════════════RAG in production is a systems engineering problem, not just an AI problem. The models are a component. The chunking, retrieval, evaluation, and operational instrumentation are what separate a prototype that wowed the demo audience from a system that delivers value reliably, every day, without waking you up at 3am.