AI EngineeringNovember 18, 2024

What I learned building a production RAG pipeline

Retrieval-Augmented Generation is easy to demo and hard to ship. Notes from wiring RAG into sajjad.ai — chunking, embeddings, and grounding answers people actually trust.

Retrieval-Augmented Generation looks deceptively simple in a tutorial: embed some text, drop it in a vector store, stuff the top matches into a prompt. Getting it to behave in production — across English, Arabic, and Urdu — is a different exercise entirely.

Chunking is a product decision

The naive approach is to split documents every N characters. That destroys meaning at boundaries and returns half-sentences. I moved to structure-aware chunking that respects headings and paragraphs, with overlap so context is not lost at the seams. The right chunk size depends on the document type, so it became a tunable rather than a constant.

Retrieval quality beats model size

A bigger model cannot fix bad context. Most of the wins came from improving what I retrieved: better embeddings, hybrid keyword-plus-semantic search, and re-ranking the candidates before they ever reached the prompt. When retrieval is good, even a smaller model answers well.

Ground every answer, and say when you cannot

The fastest way to lose trust is a confident, wrong answer. I made the system cite the chunks it used and explicitly fall back to “I do not have that in the provided documents” when retrieval confidence is low. Users forgive “I do not know” far more readily than a fabricated fact.

Takeaway

RAG is a systems problem, not a prompt. Treat retrieval, chunking, and grounding as first-class engineering — the model is the last and smallest part of the stack.

Written by Sajjad Arif Gul