AI Learning Studio — Become an AI Expert

RAG Pipeline Overview

RAG (Retrieval-Augmented Generation) augments LLM generation with retrieval from external knowledge bases. It is the foundation for enterprise Q&A and document assistants. A typical flow is: chunk → embed → store → retrieve → rerank → inject into prompt → generate.

Chunking

Chunking determines retrieval granularity and quality:

Fixed length: Simple but may cut semantics; good for general text
Semantic chunking: Split by paragraphs, sections, headings to preserve meaning
Overlapping windows: Adjacent chunks overlap to avoid losing boundary information
Mixed strategies: Title + content, code by function/class, tables handled separately

Recommended chunk size: 256–512 tokens, tuned by document type.

Embedding

Convert text to vectors for similarity:

Model choice: OpenAI, Cohere, open-source like BGE, m3e, E5
Dimensions: 768–1536 common; higher usually better but more storage and compute
Normalization: L2-normalize vectors for cosine similarity in most cases

Retrieval

Vector search: FAISS, Milvus, Pinecone, pgvector
Hybrid retrieval: Vector + keyword (BM25) for both semantic and exact match
Filtering: Filter by metadata (source, time, type) to narrow the search space

Reranking

After initial retrieval returns Top-N (e.g., 50), use a Reranker to refine to Top-K (e.g., 5):

Cross-encoders: BGE-reranker, Cohere Rerank, etc., score query-doc pairs
Cost: Only a small candidate set is reranked, so latency and cost stay manageable
Benefit: Noticeably improves accuracy of the final injected context

Optimization Strategies

Query rewriting: Expand user questions into multiple sub-queries and merge results

HyDE: Use the LLM to generate a hypothetical answer first, then retrieve using it as the query for better semantic alignment

Multi-path recall: Vector + BM25 + graph retrieval, then fuse and rank

Context compression: Summarize or extract key sentences from retrieved long documents to reduce token usage

Summary

RAG quality depends on how chunking, embedding, retrieval, and reranking work together. Start with chunking strategy, choose good Embedding and Reranker models, then add hybrid retrieval and context compression for systematic gains in accuracy and efficiency.

RAG Architecture Design & Optimization