Building a Production RAG Pipeline: Lessons Learned

Ravinder·April 14, 2026·9 min read

AIRAGLLMVector SearchProduction ML

Building a Production RAG Pipeline: Lessons Learned

The Gap Between Demo and Production

Every RAG demo looks the same: load a PDF, embed it, ask a question, get an impressively relevant answer. The demo works because you wrote the question knowing what is in the document. You tested the exact happy path.

Production RAG is different. Users ask questions you did not anticipate. Documents have inconsistent formatting. The embedding model you chose six months ago has been superseded. Retrieval latency spikes on large document sets. The LLM confidently answers questions using context that does not actually contain the answer.

I have built three RAG systems that reached production scale. This post is everything I wish someone had told me before the first one.

The Pipeline Architecture

A production RAG pipeline has two distinct phases that run at different times.

flowchart TD subgraph Ingestion ["Ingestion Pipeline (offline)"] Docs["Raw Documents\n(PDF, HTML, MD)"] --> Extract["Text Extraction\n+ Cleaning"] Extract --> Chunk["Chunking\nStrategy"] Chunk --> Embed["Embedding\nModel"] Embed --> Store["Vector Store\n+ Metadata Index"] end subgraph Query ["Query Pipeline (online, latency-sensitive)"] Q["User Query"] --> QEmbed["Query Embedding"] QEmbed --> Retrieve["ANN Search\n(Top-K)"] Store --> Retrieve Retrieve --> Rerank["Reranker\n(optional)"] Rerank --> Assemble["Prompt\nAssembly"] Assemble --> LLM["LLM\nGeneration"] LLM --> Out["Response"] end subgraph Eval ["Evaluation Loop (continuous)"] Out --> Faith["Faithfulness\nScore"] Out --> Rel["Answer\nRelevance"] Faith --> Monitor["Monitoring\nDashboard"] Rel --> Monitor end style Ingestion fill:#EFF6FF,stroke:#BFDBFE style Query fill:#F0FDF4,stroke:#BBF7D0 style Eval fill:#FFF7ED,stroke:#FED7AA

The ingestion pipeline runs once (and then incrementally as documents change). It is batch, offline, and can take minutes. The query pipeline runs on every user request. It must be fast. These two pipelines have completely different optimisation targets.

Ingestion: Where Most Mistakes Live

Chunking strategy is not a footnote

Most tutorials tell you to split every 512 tokens with 50-token overlap and move on. That is fine for demos. It is wrong for production.

The right chunking strategy depends on your document structure:

Document type	Strategy	Reasoning
Structured docs (API reference)	Section-based	Respect heading hierarchy
Prose documents	Semantic chunking	Keep related ideas together
Tables / spreadsheets	Row-group chunking	Preserve row context
Code files	Function-level	Semantic unit is the function
Chat/transcript	Turn-based	Each turn is a unit

Semantic chunking uses an embedding model to detect topic shifts and splits at natural boundaries rather than at fixed token counts. It produces better retrieval at the cost of more preprocessing time.

from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
# Semantic chunker — splits at embedding similarity drops
splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split when similarity drops below 95th percentile
)
 
chunks = splitter.create_documents([document_text])

Metadata is a retrieval multiplier

Every chunk should carry metadata. Not just the document title — structured metadata you can filter on during retrieval.

{
    "chunk_id": "doc-abc-chunk-042",
    "document_id": "doc-abc",
    "document_title": "Kubernetes Networking Guide",
    "section": "Network Policies",
    "subsection": "Egress Rules",
    "created_at": "2026-01-15",
    "version": "1.4.2",
    "content_type": "technical_docs",
    "chunk_text": "Egress network policies restrict outbound traffic from pods..."
}

When a user asks "what changed in version 1.4?", you can pre-filter to version = "1.4.*" before running similarity search. This dramatically improves precision and reduces the cost of retrieval.

Embedding model stability

Here is the production problem nobody warns you about: embedding model drift.

You embed your document corpus with text-embedding-ada-002. Six months later you switch to a better model. The new model's vector space is incompatible with the old one. Your query embeddings no longer align with your stored document embeddings. Retrieval quality collapses silently.

flowchart LR Old["Old Embeddings\n(ada-002 space)"] -- "Incompatible!" --- New["New Query Embeddings\n(text-3-large space)"] subgraph Solution V["Version tag\nevery embedding"] R["Re-embed corpus\non model upgrade"] AB["A/B test new model\nbefore full migration"] end

Rule: tag every embedding with the model name and version. When you upgrade the embedding model, re-embed the full corpus into a new namespace and test retrieval quality before switching.

Retrieval: Getting the Right Chunks

Approximate Nearest Neighbour is not free

At 10,000 documents, ANN search is fast regardless of index settings. At 10 million chunks, index choice matters enormously.

graph TD Size{"Corpus size?"} Size -->|"< 100K chunks"| Flat["Flat index\n(exact search)\nSimple, accurate"] Size -->|"100K – 10M"| HNSW["HNSW\n(pgvector, Weaviate)\nFast, good recall"] Size -->|"> 10M"| IVF["IVF-PQ\n(Faiss, Pinecone)\nPartitioned, scalable"] style Flat fill:#D1FAE5,stroke:#10B981 style HNSW fill:#DBEAFE,stroke:#3B82F6 style IVF fill:#FEF3C7,stroke:#F59E0B

For most enterprise RAG systems, HNSW in pgvector or Weaviate is the right choice. You get SQL familiarity, metadata filtering on the same index, and recall > 99% with properly tuned ef_search parameters.

-- pgvector: filtered similarity search with metadata
SELECT chunk_id, chunk_text, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE document_type = 'technical_docs'
  AND created_at > '2025-01-01'
ORDER BY embedding <=> $1
LIMIT 10;

Hybrid search beats pure semantic search

Semantic search is great for conceptual queries. Keyword search is great for exact terms (product codes, names, version numbers). Hybrid search combines both.

from pinecone import Pinecone
from rank_bm25 import BM25Okapi
 
def hybrid_search(query: str, k: int = 10, alpha: float = 0.7) -> list[dict]:
    """
    alpha=1.0: pure semantic, alpha=0.0: pure BM25
    alpha=0.7 is a good starting point
    """
    # Semantic results
    query_embedding = embed(query)
    semantic_results = vector_store.query(vector=query_embedding, top_k=k * 2)
 
    # BM25 keyword results
    bm25_scores = bm25_index.get_scores(query.split())
    bm25_results = get_top_k_by_score(bm25_scores, k=k * 2)
 
    # Reciprocal rank fusion
    return reciprocal_rank_fusion(
        semantic_results, bm25_results,
        weights=(alpha, 1 - alpha),
        top_k=k
    )

In my experience, hybrid search with alpha=0.6–0.8 consistently outperforms pure semantic search on enterprise document sets, particularly for technical documentation with lots of specific terminology.

Reranking: the quality booster you should budget for

ANN retrieval optimises for speed, not precision. A cross-encoder reranker takes the top-K chunks and re-scores them with a more expensive but more accurate model. You typically retrieve 20 chunks with ANN and rerank to the top 5.

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_n]]

Reranking adds 50–200ms latency. Whether it is worth it depends on your latency budget. In my testing it improves answer quality by 15–25% on complex queries. For a customer support RAG system, that difference directly impacts resolution rates.

Prompt Assembly

The prompt is not an afterthought. It is a critical system component that determines how well the LLM uses the retrieved context.

SYSTEM_PROMPT = """You are a helpful assistant answering questions based on the provided context.
 
Rules:
- Answer ONLY using information from the context below
- If the context does not contain enough information to answer confidently, say so explicitly
- Quote the relevant section when possible
- Never invent information not present in the context
"""
 
def assemble_prompt(query: str, chunks: list[dict]) -> list[dict]:
    context_sections = []
    for i, chunk in enumerate(chunks, 1):
        context_sections.append(
            f"[Source {i}: {chunk['document_title']}, {chunk['section']}]\n"
            f"{chunk['chunk_text']}"
        )
    
    context = "\n\n---\n\n".join(context_sections)
    
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
    ]

Include source metadata in each context section. This allows the LLM to cite sources and gives you the ability to verify answers against the retrieved chunks.

Evaluation: How to Know If It Works

This is the most neglected part of RAG systems. You cannot ship without knowing your retrieval quality.

graph TD Metrics["RAG Evaluation Metrics"] Metrics --> R["Retrieval Quality"] Metrics --> G["Generation Quality"] R --> R1["Context Recall\n(Is the answer in the retrieved chunks?)"] R --> R2["Context Precision\n(Are retrieved chunks relevant?)"] G --> G1["Faithfulness\n(Does the answer reflect the context?)"] G --> G2["Answer Relevance\n(Does the answer address the question?)"] style R fill:#DBEAFE,stroke:#3B82F6 style G fill:#D1FAE5,stroke:#10B981

RAGAS is the standard evaluation framework:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
 
# Your golden eval dataset
eval_data = {
    "question": ["What is the maximum file size?", ...],
    "answer": [rag_pipeline.answer(q) for q in questions],
    "contexts": [rag_pipeline.retrieve(q) for q in questions],
    "ground_truth": ["The maximum file size is 100MB", ...],
}
 
result = evaluate(
    Dataset.from_dict(eval_data),
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
 
print(result)
# faithfulness: 0.91 | answer_relevancy: 0.87 | context_recall: 0.83

Run evaluation on every significant change: new embedding model, new chunking strategy, new LLM, new prompt template. Track scores over time. Treat a drop of more than 5% as a regression.

Production Failure Modes

These are the things that will page you:

1. Empty context retrieval: The query finds no relevant chunks (similarity below threshold). The LLM receives empty context and either refuses to answer or hallucinates. Fix: always return a minimum number of chunks; add a fallback response for low-confidence retrievals.

2. Context window overflow: 20 retrieved chunks × 500 tokens each = 10,000 tokens. Many models have 8K context limits. You either truncate silently or hit an API error. Fix: enforce a token budget in prompt assembly, not just a chunk count.

3. Stale embeddings: Documents updated in the source system but not re-ingested. The vector store serves outdated context. Fix: build an incremental ingestion pipeline that monitors document changes and re-embeds modified documents within SLA.

4. Embedding model rate limits: Your ingestion pipeline hammers the embedding API and gets throttled. Fix: add exponential backoff, use batched embedding calls, implement a queue with rate limiting.

5. LLM latency spikes: p99 latency of 8 seconds on a feature you promised would be "instant". Fix: set aggressive timeouts, stream responses to the UI, give users a typing indicator.

The Production Checklist

RAG Production Readiness
═══════════════════════════════════════════
Ingestion
  ☐ Chunking strategy validated against document types
  ☐ Metadata schema defined and populated
  ☐ Embedding model version tagged on every vector
  ☐ Incremental ingestion pipeline operational
  ☐ Re-ingestion playbook for model upgrades
 
Retrieval
  ☐ Index type appropriate for corpus size
  ☐ Hybrid search tested vs pure semantic
  ☐ Reranker evaluated (quality vs latency trade-off)
  ☐ Low-similarity fallback implemented
 
Generation
  ☐ System prompt reviewed and tested
  ☐ Token budget enforced in prompt assembly
  ☐ Source attribution included in context
  ☐ Streaming response implemented for UI
 
Evaluation
  ☐ Golden eval dataset of 100+ questions created
  ☐ RAGAS baseline scores recorded
  ☐ Evaluation runs in CI on model/prompt changes
  ☐ Faithfulness threshold for production deploy set
 
Operations
  ☐ Retrieval latency tracked (p50, p95, p99)
  ☐ LLM call cost tracked per query
  ☐ Empty context rate monitored
  ☐ Stale document alert configured
═══════════════════════════════════════════

RAG in production is a systems engineering problem, not just an AI problem. The models are a component. The chunking, retrieval, evaluation, and operational instrumentation are what separate a prototype that wowed the demo audience from a system that delivers value reliably, every day, without waking you up at 3am.