Tình huống: bạn có 10-K (báo cáo tài chính) 800 trang của Apple. User hỏi:
- Hiểu vấn đề RAG giải quyết: document quá lớn cho prompt
- Phân biệt 2 option: stuff vs chunk+retrieve
- Biết trade-off RAG: complexity tăng, nhưng scale + cost cải thiện
- Nhận diện khi nào cần RAG vs gửi full doc
Công thức RAG
3 bước preprocessing (one-time), 4 bước runtime (mỗi query).
┌──────────────────────────────────────────────────┐ │ │ │ PREPROCESSING (một lần): │ │ 1. Chunk docs thành pieces │ │ 2. Generate embeddings cho mỗi chunk │ │ 3. Store vào vector database │ │ │ │ QUERY TIME (mỗi request): │ │ 4. Embed user question │ │ 5. Search similar chunks (top K) │ │ 6. Include top K chunks in prompt │ │ 7. Send to Claude → response │ │ │ └──────────────────────────────────────────────────┘
Khi nào dùng RAG?
✅ Cần RAG khi
❌ Không cần RAG khi
⚠️ Consider hybrid
- Document > 100K tokens
- Multi-doc corpus (search across all)
- Cost-sensitive (high query volume)
- User questions unpredictable (can't pre-select relevant sections)
- Long-term knowledge base (docs update, re-index)
- Doc < 20K tokens → stuff vào prompt OK
- Static doc, few queries → cost stuff is fine
- Need holistic view (summarize entire doc) → stuff
- Q&A về structure global → RAG miss picture
- Doc 100K-1M tokens: dùng Sonnet 1M context + cache. Có thể không cần RAG.
Benefits vs Challenges
| Benefits | Challenges |
|---|---|
| Scale to huge corpora | Preprocessing setup |
| Lower cost per query | Need search mechanism |
| Faster responses | Retrieval quality affects answer |
| Focused context → better answers | Chunk strategy matters |
| Works with multi-docs | Relevant context might miss |
Ví dụ: Corporate wiki
Setup
Company wiki: 5000 articles, 2M tokens total.
Without RAG
Impossible — exceed all context windows, costs 100x.
With RAG
100 users, 10 queries each/day = $10/day. Affordable.
- Preprocessing: chunk 5000 articles → 20K chunks. Embed. Store.
- Query: "Policy về WFH?" → retrieve top 5 relevant chunks → send to Claude.
- Cost per query: $0.01
- Response time: 2-3s
Ví dụ: Customer support KB
Scenario
500 product docs. Customer asks: "Why is my device not charging?"
RAG flow
Result: Accurate answer, source-backed, scale qua 500 docs.
- Embed question
- Retrieve: "Troubleshooting charging issues" article (top match)
- Claude receives article + question → response
- Claude cite article source
Technical decisions trong RAG
You'll decide:
Mỗi decision có trade-off. Module 7 cover từng piece.
- Chunk size (500-2000 tokens typical)
- Chunk overlap (10-20% for context preservation)
- Chunking method (size-based, sentence, structural, semantic)
- Embedding model (VoyageAI, OpenAI, Cohere)
- Vector DB (Pinecone, Chroma, Qdrant, Weaviate)
- Top K (5-20 chunks retrieve)
- Re-rank (use Claude or dedicated reranker to refine top K)
- Hybrid search (semantic + keyword)
Anti-patterns preview
❌ RAG cho mọi task
Simple Q&A 1 doc 10 trang → stuff vào prompt, không cần RAG.
❌ Chunk không có overlap
Info cắt giữa câu → chunk missing context.
❌ Top K=1
Chỉ 1 chunk → thiếu context.
❌ No eval on retrieval
Không biết top K có chứa relevant info không.
Full anti-patterns ở mỗi bài tiếp theo.
Áp dụng ngay
Bài tập 1: Assess RAG need (10 phút)
Cho app của bạn:
Decision: RAG yes/no? Why?
Bài tập 2: Chunk sketch (10 phút)
Lấy 1 sample doc. Plan chunk strategy:
- Total corpus size: ___
- Queries/day: ___
- Context window model: ___
- Cost per query (stuffed): ___ vs (RAG): ___
- Method: size/sentence/structure?
- Chunk size: ___
- Overlap: ___
Tóm tắt
🎯 RAG = chunk docs, retrieve relevant, inject into prompt. Scale + cost.
🎯 Preprocessing once, query runtime. 7-step pipeline.
🎯 Trade-off: complexity up, cost + scale dramatic improvement.
🎯 Use RAG khi: big corpus, high volume, unpredictable queries.
🎯 Module 7 cover từng piece: chunking, embeddings, search, hybrid, multi-index.