Nhập môn RAG — Retrieval Augmented Generation — Building with the Claude API

Tình huống: bạn có 10-K (báo cáo tài chính) 800 trang của Apple. User hỏi:

Bạn sẽ học được

Hiểu vấn đề RAG giải quyết: document quá lớn cho prompt
Phân biệt 2 option: stuff vs chunk+retrieve
Biết trade-off RAG: complexity tăng, nhưng scale + cost cải thiện
Nhận diện khi nào cần RAG vs gửi full doc

Công thức RAG

3 bước preprocessing (one-time), 4 bước runtime (mỗi query).

┌──────────────────────────────────────────────────┐
│                                                  │
│  PREPROCESSING (một lần):                        │
│     1. Chunk docs thành pieces                   │
│     2. Generate embeddings cho mỗi chunk         │
│     3. Store vào vector database                 │
│                                                  │
│  QUERY TIME (mỗi request):                       │
│     4. Embed user question                       │
│     5. Search similar chunks (top K)             │
│     6. Include top K chunks in prompt            │
│     7. Send to Claude → response                 │
│                                                  │
└──────────────────────────────────────────────────┘

Khi nào dùng RAG?

✅ Cần RAG khi

❌ Không cần RAG khi

⚠️ Consider hybrid

Document > 100K tokens
Multi-doc corpus (search across all)
Cost-sensitive (high query volume)
User questions unpredictable (can't pre-select relevant sections)
Long-term knowledge base (docs update, re-index)
Doc < 20K tokens → stuff vào prompt OK
Static doc, few queries → cost stuff is fine
Need holistic view (summarize entire doc) → stuff
Q&A về structure global → RAG miss picture
Doc 100K-1M tokens: dùng Sonnet 1M context + cache. Có thể không cần RAG.

Benefits vs Challenges

Benefits	Challenges
Scale to huge corpora	Preprocessing setup
Lower cost per query	Need search mechanism
Faster responses	Retrieval quality affects answer
Focused context → better answers	Chunk strategy matters
Works with multi-docs	Relevant context might miss

Ví dụ: Corporate wiki

Setup

Company wiki: 5000 articles, 2M tokens total.

Without RAG

Impossible — exceed all context windows, costs 100x.

With RAG

100 users, 10 queries each/day = $10/day. Affordable.

Preprocessing: chunk 5000 articles → 20K chunks. Embed. Store.
Query: "Policy về WFH?" → retrieve top 5 relevant chunks → send to Claude.
Cost per query: $0.01
Response time: 2-3s

Ví dụ: Customer support KB

Scenario

500 product docs. Customer asks: "Why is my device not charging?"

RAG flow

Result: Accurate answer, source-backed, scale qua 500 docs.

Embed question
Retrieve: "Troubleshooting charging issues" article (top match)
Claude receives article + question → response
Claude cite article source

Technical decisions trong RAG

You'll decide:

Mỗi decision có trade-off. Module 7 cover từng piece.

Chunk size (500-2000 tokens typical)
Chunk overlap (10-20% for context preservation)
Chunking method (size-based, sentence, structural, semantic)
Embedding model (VoyageAI, OpenAI, Cohere)
Vector DB (Pinecone, Chroma, Qdrant, Weaviate)
Top K (5-20 chunks retrieve)
Re-rank (use Claude or dedicated reranker to refine top K)
Hybrid search (semantic + keyword)

Anti-patterns preview

❌ RAG cho mọi task

Simple Q&A 1 doc 10 trang → stuff vào prompt, không cần RAG.

❌ Chunk không có overlap

Info cắt giữa câu → chunk missing context.

❌ Top K=1

Chỉ 1 chunk → thiếu context.

❌ No eval on retrieval

Không biết top K có chứa relevant info không.

Full anti-patterns ở mỗi bài tiếp theo.

Áp dụng ngay

Bài tập 1: Assess RAG need (10 phút)

Cho app của bạn:

Decision: RAG yes/no? Why?

Bài tập 2: Chunk sketch (10 phút)

Lấy 1 sample doc. Plan chunk strategy:

Total corpus size: ___
Queries/day: ___
Context window model: ___
Cost per query (stuffed): ___ vs (RAG): ___
Method: size/sentence/structure?
Chunk size: ___
Overlap: ___

Tóm tắt

🎯 RAG = chunk docs, retrieve relevant, inject into prompt. Scale + cost.

🎯 Preprocessing once, query runtime. 7-step pipeline.

🎯 Trade-off: complexity up, cost + scale dramatic improvement.

🎯 Use RAG khi: big corpus, high volume, unpredictable queries.

🎯 Module 7 cover từng piece: chunking, embeddings, search, hybrid, multi-index.

Nội dung này có hữu ích không?