Back to Blog
RAGAIEngineering

Building Production RAG Systems: Lessons from the Trenches

12 min read

RAG (Retrieval-Augmented Generation) is the most practical way to make LLMs useful with your own data. But the gap between a demo RAG system and a production one is enormous.

I've built RAG systems for three different companies — a legal tech startup, a healthcare platform, and an enterprise knowledge base. Each one taught me something the tutorials don't cover.

The Architecture That Works

After multiple iterations, here's the architecture I keep coming back to:

  • Document ingestion — Parse, clean, and chunk your documents
  • Dual embedding — Generate both dense (semantic) and sparse (keyword) embeddings
  • Hybrid retrieval — Combine semantic search with BM25 keyword matching
  • Re-ranking — Use a cross-encoder to re-rank the top results
  • Generation — Feed the best context to the LLM with a well-crafted prompt
  • Chunking: The Most Underrated Step

    Everyone obsesses over which embedding model to use. But chunking strategy has a bigger impact on retrieval quality than any other single decision.

    Here's what I've learned:

    Don't use fixed-size chunks

    Splitting documents every 500 characters is the default in most tutorials. It's also the worst approach for production. You'll split sentences mid-thought, separate headers from their content, and lose the logical structure of the document.

    Use semantic chunking instead

    Split at natural boundaries:

  • Paragraph breaks
  • Section headers
  • Topic changes (detected by embedding similarity)
  • Keep metadata with your chunks

    Every chunk should carry:

  • Source document title and URL
  • Section hierarchy (h1 > h2 > h3)
  • Page number or position
  • Date of the document
  • This metadata is gold for filtering and attribution.

    Overlap is essential

    Use 10-20% overlap between chunks. Without it, you'll miss context that spans chunk boundaries.

    Retrieval: Where Most Systems Fail

    Retrieval is the bottleneck. If you don't retrieve the right context, even the best LLM will give wrong answers.

    Hybrid search beats pure semantic

    Semantic search is great for meaning, but terrible for exact terms. When a user asks "What's the policy for ACME-2024-Q3?", semantic search might return policies from other quarters. BM25 keyword matching catches these exact matches.

    Use both and merge the results.

    Re-ranking is worth the latency

    A cross-encoder re-ranker adds 200-400ms of latency. It's worth it. In my tests, re-ranking improved answer accuracy by 15-25% compared to raw retrieval.

    Query expansion works wonders

    Before searching, expand the user's query:

  • Generate 2-3 alternative phrasings
  • Extract key entities and search for them separately
  • Use the LLM to identify what information is actually needed
  • Generation: Prompting Matters More Than You Think

    The generation prompt is where most teams leave performance on the table.

    Key principles:

  • Be explicit about the context — Tell the model exactly how to use the retrieved documents
  • Require citations — Make the model reference specific documents in its answers
  • Handle "I don't know" — The model should admit when the context doesn't contain the answer
  • Format the output — Structure the response for your specific use case
  • Evaluation: You Can't Improve What You Don't Measure

    Build an evaluation pipeline from day one:

  • Retrieval metrics — Precision@k, Recall@k, MRR (Mean Reciprocal Rank)
  • Answer quality — Faithfulness (does the answer match the context?), Relevance (does it answer the question?)
  • End-to-end — User satisfaction scores, task completion rates
  • Create a golden test set of 50-100 question-answer pairs. Run every change through this test set before deploying.

    The Mistakes I Made So You Don't Have To

  • Embedding model choice barely matters — text-embedding-3-large vs. text-embedding-3-small made a 2% difference. Chunking strategy made a 20% difference.
  • Don't index everything — Curate what goes into your knowledge base. Garbage in, garbage out.
  • Latency budgets are real — Users expect answers in 2-3 seconds. Plan your architecture around this constraint.
  • Versioning is essential — Track which version of your documents, embeddings, and prompts produced each answer.
  • What's Next for RAG

    The field is moving fast. Keep an eye on:

  • Agentic RAG — Systems that can search multiple sources, refine their queries, and reason over results
  • Graph RAG — Combining knowledge graphs with vector search for structured reasoning
  • Multimodal RAG — Indexing and searching images, tables, and diagrams alongside text
  • RAG isn't going away. If anything, it's becoming the default way to build AI applications with proprietary data.