Building Production RAG Pipelines: Lessons from the Field
Retrieval-Augmented Generation sounds simple in tutorials: chunk your docs, embed them, query a vector DB, feed results to an LLM. Shipped to production with real users and real data, the gap between the tutorial and the system that actually works is enormous.
Here's what I've learned building RAG pipelines for enterprise clients.
Chunking strategy is everything
Most tutorials chunk by fixed token count. In production, this destroys context. A 512-token chunk landing mid-paragraph loses the surrounding meaning that gives it value.
What actually works better:
- Semantic chunking — split on paragraph and section boundaries, not character counts
- Overlap with context headers — prepend the document title and section heading to every chunk, so the LLM always knows where the answer came from
- Hierarchical retrieval — retrieve parent chunks alongside matched child chunks to preserve surrounding context
Retrieval quality trumps generation quality
When a RAG answer is wrong, engineers usually blame the LLM. 90% of the time the real problem is retrieval — the right chunk simply wasn't fetched.
Before tuning prompts, fix retrieval:
- Hybrid search — combine dense vector search with sparse BM25. Dense search finds semantic matches; sparse search catches exact keywords. Neither alone is enough.
- Reranking — after retrieving top-20 chunks, run a cross-encoder reranker to select the best 3-5. The latency cost is worth it.
- Query expansion — have the LLM generate 3 alternative phrasings of the user question and run all of them. Union the results.
Citations are non-negotiable for enterprise
Users will not trust an answer they cannot verify. Every response needs to show exactly which document and section it came from.
Implementation: store source_document, page_number, and chunk_index as metadata alongside every vector. Surface them in the response. The LangChain RetrievalQAWithSourcesChain gives you this for free, but I ended up writing a custom chain to control the citation format precisely.
Evaluation is the hardest part
You can't improve what you don't measure. The RAG eval stack I use:
- Ragas for automated metrics — answer relevancy, context precision, faithfulness
- Golden test set — 50-100 hand-curated Q&A pairs representative of real queries
- Run evals on every prompt change and every new batch of documents
Without a test set, RAG development is flying blind.
Watch your token budgets
In production, context windows fill up fast. A retrieval that returns 5 chunks × 512 tokens = 2,560 tokens of context before you've added system prompt or conversation history.
Set hard limits. Prioritize chunks by reranker score and cut aggressively. A model that reads 3 excellent chunks beats one drowning in 10 mediocre ones.
The unsexy truth: most RAG production work is data cleaning, chunking strategy, and evaluation infrastructure. The LLM API call is 5 lines. Everything around it is engineering.