RAG (Retrieval-Augmented Generation) is a software pattern where you store a corpus of domain-specific documents in a vector database, retrieve the most-relevant passages for a given user query, and pass those passages as context to a large language model. The LLM generates an answer grounded in the retrieved content rather than relying solely on its training data. RAG is the dominant pattern for building AI features over private corporate knowledge bases.

What is RAG? | James Henderson

The longer answer

RAG matters because LLMs alone don\'t know your specific business — your policies, your product catalog, your customer history, your internal documentation. Training a model on that data is expensive and slow; RAG sidesteps the problem by retrieving the relevant context at query time and stuffing it into the prompt. The result is a system that can answer questions about your business with verifiable citations back to the source documents.

How RAG actually works

At build time: take your corporate corpus (PDFs, wiki pages, support tickets, contracts, whatever), split into 500-1000-word chunks, generate vector embeddings for each chunk (typically via OpenAI\'s embedding API or an open-source equivalent), and store the vectors in a vector database (Pinecone, Weaviate, pgvector on Postgres, Qdrant). At query time: take the user\'s question, generate an embedding for it, find the K most-similar chunks in the vector database, and pass those chunks plus the question to an LLM with instructions to answer using only the provided context.

Why RAG fails (and how to make it not fail)

RAG systems fail in three predictable ways. Retrieval failures: the right chunk isn\'t in the top-K results because the question phrasing doesn\'t match the document phrasing. Mitigation: hybrid search (vector + keyword BM25), query rewriting, reranking. Citation failures: the LLM makes up an answer despite the retrieved context being available. Mitigation: explicit prompt instructions to cite, post-generation verification that the answer matches the cited chunks, and an eval harness that catches drift. Cost / latency failures: a naive RAG implementation costs $0.50/query and takes six seconds. Mitigation: caching, smaller-model routing for simple queries, batched retrieval, and reranking only the top-N candidates.

When NOT to use RAG

If the LLM already knows the domain (general knowledge, code generation in a popular language, common-sense reasoning), RAG adds complexity without value. If the query needs reasoning over a structured database (sales numbers, customer counts, time-series), use SQL generation rather than RAG. If the corpus is small (under 50 documents) and stable, you can just include it in the prompt directly.

Common follow-up questions

What's the difference between RAG and fine-tuning?

RAG retrieves context at query time and feeds it to a general-purpose LLM. Fine-tuning permanently modifies the model's weights using your corpus. RAG is faster to build, easier to update, and produces verifiable citations; fine-tuning produces lower per-query latency and can capture style / format that RAG cannot. Most production AI systems use RAG; fine-tuning is reserved for specialized cases (specific output formats, domain jargon, style requirements).

Which vector database should I use?

For most Laravel / Postgres-shop applications, pgvector on Postgres is the right starting point — no new infrastructure, mature operational posture, sub-100ms similarity search up to millions of vectors. For larger scale or specialized use cases, Qdrant or Weaviate.

How much corpus is enough for RAG?

Practically: from a handful of documents up to a few million. Below 50 documents, you can include them all in the LLM context directly without retrieval. Above a few million chunks, you need real retrieval engineering (sharding, reranking, query routing). The sweet spot is thousands to hundreds of thousands of chunks.

What is RAG?