AIRAGLangChainPython

Building Production RAG Pipelines: Lessons from the Field

March 12, 20253 min read

Retrieval-Augmented Generation sounds simple in tutorials: chunk your docs, embed them, query a vector DB, feed results to an LLM. Shipped to production with real users and real data, the gap between the tutorial and the system that actually works is enormous.

Here's what I've learned building RAG pipelines for enterprise clients.

Chunking strategy is everything

Most tutorials chunk by fixed token count. In production, this destroys context. A 512-token chunk landing mid-paragraph loses the surrounding meaning that gives it value.

What actually works better:

Semantic chunking — split on paragraph and section boundaries, not character counts
Overlap with context headers — prepend the document title and section heading to every chunk, so the LLM always knows where the answer came from
Hierarchical retrieval — retrieve parent chunks alongside matched child chunks to preserve surrounding context

Retrieval quality trumps generation quality

When a RAG answer is wrong, engineers usually blame the LLM. 90% of the time the real problem is retrieval — the right chunk simply wasn't fetched.

Before tuning prompts, fix retrieval:

Hybrid search — combine dense vector search with sparse BM25. Dense search finds semantic matches; sparse search catches exact keywords. Neither alone is enough.
Reranking — after retrieving top-20 chunks, run a cross-encoder reranker to select the best 3-5. The latency cost is worth it.
Query expansion — have the LLM generate 3 alternative phrasings of the user question and run all of them. Union the results.

Citations are non-negotiable for enterprise

Users will not trust an answer they cannot verify. Every response needs to show exactly which document and section it came from.

Implementation: store source_document, page_number, and chunk_index as metadata alongside every vector. Surface them in the response. The LangChain RetrievalQAWithSourcesChain gives you this for free, but I ended up writing a custom chain to control the citation format precisely.

Evaluation is the hardest part

You can't improve what you don't measure. The RAG eval stack I use:

Ragas for automated metrics — answer relevancy, context precision, faithfulness
Golden test set — 50-100 hand-curated Q&A pairs representative of real queries
Run evals on every prompt change and every new batch of documents

Without a test set, RAG development is flying blind.

Watch your token budgets

In production, context windows fill up fast. A retrieval that returns 5 chunks × 512 tokens = 2,560 tokens of context before you've added system prompt or conversation history.

Set hard limits. Prioritize chunks by reranker score and cut aggressively. A model that reads 3 excellent chunks beats one drowning in 10 mediocre ones.

The unsexy truth: most RAG production work is data cleaning, chunking strategy, and evaluation infrastructure. The LLM API call is 5 lines. Everything around it is engineering.

Back to all articles