Building Production RAG Systems: Lessons from the Trenches

RAG (Retrieval-Augmented Generation) is the most practical way to make LLMs useful with your own data. But the gap between a demo RAG system and a production one is enormous.

I've built RAG systems for three different companies — a legal tech startup, a healthcare platform, and an enterprise knowledge base. Each one taught me something the tutorials don't cover.

The Architecture That Works

After multiple iterations, here's the architecture I keep coming back to:

Document ingestion — Parse, clean, and chunk your documents

Dual embedding — Generate both dense (semantic) and sparse (keyword) embeddings

Hybrid retrieval — Combine semantic search with BM25 keyword matching

Re-ranking — Use a cross-encoder to re-rank the top results

Generation — Feed the best context to the LLM with a well-crafted prompt

Chunking: The Most Underrated Step

Everyone obsesses over which embedding model to use. But chunking strategy has a bigger impact on retrieval quality than any other single decision.

Here's what I've learned:

Don't use fixed-size chunks

Splitting documents every 500 characters is the default in most tutorials. It's also the worst approach for production. You'll split sentences mid-thought, separate headers from their content, and lose the logical structure of the document.

Use semantic chunking instead

Split at natural boundaries:

Paragraph breaks

Section headers

Topic changes (detected by embedding similarity)

Keep metadata with your chunks

Every chunk should carry:

Source document title and URL

Section hierarchy (h1 > h2 > h3)

Page number or position

Date of the document

This metadata is gold for filtering and attribution.

Overlap is essential

Use 10-20% overlap between chunks. Without it, you'll miss context that spans chunk boundaries.

Retrieval: Where Most Systems Fail

Retrieval is the bottleneck. If you don't retrieve the right context, even the best LLM will give wrong answers.

Hybrid search beats pure semantic

Semantic search is great for meaning, but terrible for exact terms. When a user asks "What's the policy for ACME-2024-Q3?", semantic search might return policies from other quarters. BM25 keyword matching catches these exact matches.

Use both and merge the results.

Re-ranking is worth the latency

A cross-encoder re-ranker adds 200-400ms of latency. It's worth it. In my tests, re-ranking improved answer accuracy by 15-25% compared to raw retrieval.

Query expansion works wonders

Before searching, expand the user's query:

Generate 2-3 alternative phrasings

Extract key entities and search for them separately

Use the LLM to identify what information is actually needed

Generation: Prompting Matters More Than You Think

The generation prompt is where most teams leave performance on the table.

Key principles:

Be explicit about the context — Tell the model exactly how to use the retrieved documents

Require citations — Make the model reference specific documents in its answers

Handle "I don't know" — The model should admit when the context doesn't contain the answer

Format the output — Structure the response for your specific use case

Evaluation: You Can't Improve What You Don't Measure

Build an evaluation pipeline from day one:

Retrieval metrics — Precision@k, Recall@k, MRR (Mean Reciprocal Rank)

Answer quality — Faithfulness (does the answer match the context?), Relevance (does it answer the question?)

End-to-end — User satisfaction scores, task completion rates

Create a golden test set of 50-100 question-answer pairs. Run every change through this test set before deploying.

The Mistakes I Made So You Don't Have To

Embedding model choice barely matters — text-embedding-3-large vs. text-embedding-3-small made a 2% difference. Chunking strategy made a 20% difference.

Don't index everything — Curate what goes into your knowledge base. Garbage in, garbage out.

Latency budgets are real — Users expect answers in 2-3 seconds. Plan your architecture around this constraint.

Versioning is essential — Track which version of your documents, embeddings, and prompts produced each answer.

What's Next for RAG

The field is moving fast. Keep an eye on:

Agentic RAG — Systems that can search multiple sources, refine their queries, and reason over results

Graph RAG — Combining knowledge graphs with vector search for structured reasoning

Multimodal RAG — Indexing and searching images, tables, and diagrams alongside text

RAG isn't going away. If anything, it's becoming the default way to build AI applications with proprietary data.