RAG Pipeline Overview
RAG (Retrieval-Augmented Generation) augments LLM generation with retrieval from external knowledge bases. It is the foundation for enterprise Q&A and document assistants. A typical flow is: chunk → embed → store → retrieve → rerank → inject into prompt → generate.
Chunking
Chunking determines retrieval granularity and quality:
- Fixed length: Simple but may cut semantics; good for general text
- Semantic chunking: Split by paragraphs, sections, headings to preserve meaning
- Overlapping windows: Adjacent chunks overlap to avoid losing boundary information
- Mixed strategies: Title + content, code by function/class, tables handled separately
Embedding
Convert text to vectors for similarity:
- Model choice: OpenAI, Cohere, open-source like BGE, m3e, E5
- Dimensions: 768–1536 common; higher usually better but more storage and compute
- Normalization: L2-normalize vectors for cosine similarity in most cases
Retrieval
- Vector search: FAISS, Milvus, Pinecone, pgvector
- Hybrid retrieval: Vector + keyword (BM25) for both semantic and exact match
- Filtering: Filter by metadata (source, time, type) to narrow the search space
Reranking
After initial retrieval returns Top-N (e.g., 50), use a Reranker to refine to Top-K (e.g., 5):
- Cross-encoders: BGE-reranker, Cohere Rerank, etc., score query-doc pairs
- Cost: Only a small candidate set is reranked, so latency and cost stay manageable
- Benefit: Noticeably improves accuracy of the final injected context
Optimization Strategies
Summary
RAG quality depends on how chunking, embedding, retrieval, and reranking work together. Start with chunking strategy, choose good Embedding and Reranker models, then add hybrid retrieval and context compression for systematic gains in accuracy and efficiency.