AI
Learning Studio
Agent Development2026-03-172 min read

RAG Architecture Design & Optimization

Master RAG design from chunking, embedding, retrieval to reranking

RAGRetrieval-AugmentedEmbeddingRerankingTake NoteMark Doubt

RAG Pipeline Overview

RAG (Retrieval-Augmented Generation) augments LLM generation with retrieval from external knowledge bases. It is the foundation for enterprise Q&A and document assistants. A typical flow is: chunk → embed → store → retrieve → rerank → inject into prompt → generate.

Chunking

Chunking determines retrieval granularity and quality:

  • Fixed length: Simple but may cut semantics; good for general text
  • Semantic chunking: Split by paragraphs, sections, headings to preserve meaning
  • Overlapping windows: Adjacent chunks overlap to avoid losing boundary information
  • Mixed strategies: Title + content, code by function/class, tables handled separately
Recommended chunk size: 256–512 tokens, tuned by document type.

Embedding

Convert text to vectors for similarity:

  • Model choice: OpenAI, Cohere, open-source like BGE, m3e, E5
  • Dimensions: 768–1536 common; higher usually better but more storage and compute
  • Normalization: L2-normalize vectors for cosine similarity in most cases

Retrieval

  • Vector search: FAISS, Milvus, Pinecone, pgvector
  • Hybrid retrieval: Vector + keyword (BM25) for both semantic and exact match
  • Filtering: Filter by metadata (source, time, type) to narrow the search space

Reranking

After initial retrieval returns Top-N (e.g., 50), use a Reranker to refine to Top-K (e.g., 5):

  • Cross-encoders: BGE-reranker, Cohere Rerank, etc., score query-doc pairs
  • Cost: Only a small candidate set is reranked, so latency and cost stay manageable
  • Benefit: Noticeably improves accuracy of the final injected context

Optimization Strategies

  • Query rewriting: Expand user questions into multiple sub-queries and merge results
  • HyDE: Use the LLM to generate a hypothetical answer first, then retrieve using it as the query for better semantic alignment
  • Multi-path recall: Vector + BM25 + graph retrieval, then fuse and rank
  • Context compression: Summarize or extract key sentences from retrieved long documents to reduce token usage
  • Summary

    RAG quality depends on how chunking, embedding, retrieval, and reranking work together. Start with chunking strategy, choose good Embedding and Reranker models, then add hybrid retrieval and context compression for systematic gains in accuracy and efficiency.

    Flash Cards

    Question

    How does chunking strategy affect RAG retrieval quality?

    Click to flip

    Answer

    Large chunks add noise; small chunks lose context. Common strategies: semantic-boundary chunking, overlapping sliding windows, mixed-size chunks. Adjust for domain (e.g., code, legal docs).

    Question

    Why is a Reranking step needed?

    Click to flip

    Answer

    Vector retrieval uses semantic similarity and may return relevant but not most precise documents. Reranking uses stronger cross-encoder models to re-rank candidates, improving Top-K accuracy at reasonable cost.

    Question

    What factors matter when choosing an Embedding model?

    Click to flip

    Answer

    Consider dimensions, language support, domain fit (general vs. specialized), latency and cost. For multilingual use m3e, bge, etc.; for English consider OpenAI text-embedding-3; for long text, use models supporting 8K+ tokens.