Nhập môn RAG — Retrieval Augmented Generation

7 — RAGTrung cấp20 phút

Tình huống: bạn có 10-K (báo cáo tài chính) 800 trang của Apple. User hỏi:

Bạn sẽ học được
  • Hiểu vấn đề RAG giải quyết: document quá lớn cho prompt
  • Phân biệt 2 option: stuff vs chunk+retrieve
  • Biết trade-off RAG: complexity tăng, nhưng scale + cost cải thiện
  • Nhận diện khi nào cần RAG vs gửi full doc

Công thức RAG

3 bước preprocessing (one-time), 4 bước runtime (mỗi query).

┌──────────────────────────────────────────────────┐
│                                                  │
│  PREPROCESSING (một lần):                        │
│     1. Chunk docs thành pieces                   │
│     2. Generate embeddings cho mỗi chunk         │
│     3. Store vào vector database                 │
│                                                  │
│  QUERY TIME (mỗi request):                       │
│     4. Embed user question                       │
│     5. Search similar chunks (top K)             │
│     6. Include top K chunks in prompt            │
│     7. Send to Claude → response                 │
│                                                  │
└──────────────────────────────────────────────────┘

Khi nào dùng RAG?

✅ Cần RAG khi

❌ Không cần RAG khi

⚠️ Consider hybrid

  • Document > 100K tokens
  • Multi-doc corpus (search across all)
  • Cost-sensitive (high query volume)
  • User questions unpredictable (can't pre-select relevant sections)
  • Long-term knowledge base (docs update, re-index)
  • Doc < 20K tokens → stuff vào prompt OK
  • Static doc, few queries → cost stuff is fine
  • Need holistic view (summarize entire doc) → stuff
  • Q&A về structure global → RAG miss picture
  • Doc 100K-1M tokens: dùng Sonnet 1M context + cache. Có thể không cần RAG.

Benefits vs Challenges

BenefitsChallenges
Scale to huge corporaPreprocessing setup
Lower cost per queryNeed search mechanism
Faster responsesRetrieval quality affects answer
Focused context → better answersChunk strategy matters
Works with multi-docsRelevant context might miss

Ví dụ: Corporate wiki

Setup

Company wiki: 5000 articles, 2M tokens total.

Without RAG

Impossible — exceed all context windows, costs 100x.

With RAG

100 users, 10 queries each/day = $10/day. Affordable.

  • Preprocessing: chunk 5000 articles → 20K chunks. Embed. Store.
  • Query: "Policy về WFH?" → retrieve top 5 relevant chunks → send to Claude.
  • Cost per query: $0.01
  • Response time: 2-3s

Ví dụ: Customer support KB

Scenario

500 product docs. Customer asks: "Why is my device not charging?"

RAG flow

Result: Accurate answer, source-backed, scale qua 500 docs.

  • Embed question
  • Retrieve: "Troubleshooting charging issues" article (top match)
  • Claude receives article + question → response
  • Claude cite article source

Technical decisions trong RAG

You'll decide:

Mỗi decision có trade-off. Module 7 cover từng piece.

  • Chunk size (500-2000 tokens typical)
  • Chunk overlap (10-20% for context preservation)
  • Chunking method (size-based, sentence, structural, semantic)
  • Embedding model (VoyageAI, OpenAI, Cohere)
  • Vector DB (Pinecone, Chroma, Qdrant, Weaviate)
  • Top K (5-20 chunks retrieve)
  • Re-rank (use Claude or dedicated reranker to refine top K)
  • Hybrid search (semantic + keyword)

Anti-patterns preview

❌ RAG cho mọi task

Simple Q&A 1 doc 10 trang → stuff vào prompt, không cần RAG.

❌ Chunk không có overlap

Info cắt giữa câu → chunk missing context.

❌ Top K=1

Chỉ 1 chunk → thiếu context.

❌ No eval on retrieval

Không biết top K có chứa relevant info không.

Full anti-patterns ở mỗi bài tiếp theo.

Áp dụng ngay

Bài tập 1: Assess RAG need (10 phút)

Cho app của bạn:

Decision: RAG yes/no? Why?

Bài tập 2: Chunk sketch (10 phút)

Lấy 1 sample doc. Plan chunk strategy:

  • Total corpus size: ___
  • Queries/day: ___
  • Context window model: ___
  • Cost per query (stuffed): ___ vs (RAG): ___
  • Method: size/sentence/structure?
  • Chunk size: ___
  • Overlap: ___

Tóm tắt

🎯 RAG = chunk docs, retrieve relevant, inject into prompt. Scale + cost.

🎯 Preprocessing once, query runtime. 7-step pipeline.

🎯 Trade-off: complexity up, cost + scale dramatic improvement.

🎯 Use RAG khi: big corpus, high volume, unpredictable queries.

🎯 Module 7 cover từng piece: chunking, embeddings, search, hybrid, multi-index.

Nội dung này có hữu ích không?