- Walk through 6 bước của complete RAG pipeline
- Hiểu cosine similarity và distance
- Build prompt với retrieved context
- Ship minimal working RAG
6-step RAG pipeline
Walk through example.
┌────────────────────────────────────────────────────┐ │ │ │ PREPROCESS (once): │ │ 1. Chunk document │ │ 2. Embed chunks │ │ 3. Store in vector DB │ │ │ │ QUERY (each request): │ │ 4. Embed user query │ │ 5. Find similar chunks (top K) │ │ 6. Build prompt, call Claude │ │ │ └────────────────────────────────────────────────────┘
Setup
from dotenv import load_dotenv
load_dotenv()
import voyageai
from anthropic import Anthropic
import numpy as np
voyage = voyageai.Client()
anthropic = Anthropic()
MODEL = "claude-sonnet-5-20260205"Step 1: Chunk
Example corpus:
document = """
## Section 1: Medical Research
This year saw significant strides in our understanding of XDR-47,
a 'bug' we have not seen before. We developed a new treatment protocol
that reduced mortality by 23% in clinical trials.
## Section 2: Software Engineering
This division dedicated significant effort to studying various
infection vectors in our distributed systems. Team fixed 142 critical
bugs and improved system uptime to 99.97%.
## Section 3: Marketing
Q4 campaigns generated 3M impressions, 50K new leads.
Conversion rate improved to 4.2%.
"""
# Chunk by section
import re
chunks = re.split(r"\n## ", document.strip())
chunks = [c.strip() for c in chunks if c.strip()]
# 3 chunksStep 2: Embed chunks
def embed(text, input_type="document"):
result = voyage.embed([text], model="voyage-3-large", input_type=input_type)
return result.embeddings[0]
chunk_embeddings = [embed(c, input_type="document") for c in chunks]
# 3 embeddings, 1024-dim eachStep 3: Store (simple in-memory)
Preprocess done. Now query time.
class VectorStore:
def __init__(self):
self.docs = []
self.embeddings = []
def add(self, content, embedding, metadata=None):
self.docs.append({"content": content, "metadata": metadata or {}})
self.embeddings.append(np.array(embedding))
def search(self, query_emb, k=3):
query_vec = np.array(query_emb)
# Normalize
query_vec = query_vec / np.linalg.norm(query_vec)
sims = []
for emb in self.embeddings:
emb_norm = emb / np.linalg.norm(emb)
sims.append(float(np.dot(query_vec, emb_norm)))
top_k = sorted(enumerate(sims), key=lambda x: -x[1])[:k]
return [(self.docs[i], score) for i, score in top_k]
store = VectorStore()
for i, (chunk, emb) in enumerate(zip(chunks, chunk_embeddings)):
store.add(chunk, emb, metadata={"section": i + 1})
print(f"Stored {len(store.docs)} chunks")Step 4: Embed query
query = "How many software bugs did engineers fix?"
query_emb = embed(query, input_type="query")Step 5: Find similar
Output:
results = store.search(query_emb, k=2)
for doc, score in results:
print(f"Score: {score:.3f}")
print(f"Content: {doc['content'][:100]}...")
print()Step 5: Find similar (tiếp)
Software section top (correct). Medical section lower (false positive on word "bug", but lower score).
Score: 0.873
Content: Section 2: Software Engineering... Team fixed 142 critical bugs...
Score: 0.412
Content: Section 1: Medical Research... a 'bug' we have not seen before...Step 6: Build prompt, call Claude
Output:
def rag_query(query: str, store: VectorStore, k: int = 3) -> str:
# 1. Embed
query_emb = embed(query, input_type="query")
# 2. Retrieve
results = store.search(query_emb, k=k)
# 3. Build context
context = "\n\n---\n\n".join(
doc["content"] for doc, _ in results
)
# 4. Prompt
prompt = f"""Answer the user's question based on provided context.
<context>
{context}
</context>
<question>
{query}
</question>
Instructions:
- Answer based on context only
- If context doesn't contain answer, say "Not found in provided info"
- Cite source if possible"""
# 5. Call Claude
response = anthropic.messages.create(
model=MODEL,
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Test
answer = rag_query("How many software bugs were fixed?", store)
print(answer)Step 6: Build prompt, call Claude (tiếp)
Correct + source-grounded.
The software engineering team fixed 142 critical bugs this year,
based on the Software Engineering section of the report.Cosine similarity deep dive
Example: 2D case for intuition.
In N-dim (N=1024), same math, capture nuanced semantic distinctions.
Vec A: [0.9, 0.3] (mostly "topic 1")
Vec B: [0.8, 0.4] (mostly "topic 1")
Vec C: [0.1, 0.95] (mostly "topic 2")
A vs B: cosine = 0.97 (very similar)
A vs C: cosine = 0.37 (different)
B vs C: cosine = 0.52 (slightly related)Cosine distance
Alternative metric:
Vector DBs often use distance (smaller = closer). Same info, different scale.
- 0 = identical
- 1 = unrelated
- 2 = opposite
cosine_distance = 1 - cosine_similarityTesting the end-to-end
Try edge queries:
Eval to verify retrieval quality.
# On-topic
rag_query("What's XDR-47?", store)
# → "XDR-47 is a newly-discovered pathogen..."
# Cross-topic confusion
rag_query("Tell me about bugs.", store)
# → Retrieves both medical (XDR-47 "bug") + software bugs
# → Claude synthesizes or clarifies
# Off-topic
rag_query("What's the weather?", store)
# → "Not found in provided info"Production improvements
Basic RAG has failures. Add:
1. Top K tuning
Test với eval set.
2. Hybrid search (Bài 6.56)
Combine semantic + keyword (BM25) cho robust.
3. Re-ranking
After retrieve top 20, use Claude để rank top 5 most relevant.
4. Query rewriting
Complex query → Claude rewrites:
- K=1: precise but can miss
- K=5: balanced
- K=10: more context, may include noise
rerank_prompt = f"""Given this question: {query}
Rank these candidates by relevance (1=best):
<candidates>
{candidates}
</candidates>
"""4. Query rewriting
Rewrite before embedding.
5. Metadata filtering
Filter trước search:
"What about the bugs I mentioned earlier?"
→ "Software bugs fixed in Q4 2024"5. Metadata filtering
results = store.search(query_emb, k=5, filter={"section": "Technical"})Anti-patterns
❌ Forget input_type
Query dùng "document" input_type → similarity noisy.
Fix: Query→"query", chunk→"document".
❌ Context window overflow
Retrieve 20 chunks × 2000 tokens = 40K tokens. Claude chậm + expensive.
Fix: K=3-5 đủ. Chunk vừa phải.
❌ Không check score threshold
Top 1 có score 0.2 (very weak) vẫn dùng.
Fix: If top score < 0.5, return "not found".
❌ Re-embed corpus mỗi query
Slow + expensive.
Fix: Preprocess embeddings, save to disk/DB.
Áp dụng ngay
Bài tập 1: Run basic RAG (30 phút)
Sample doc (report.md từ bài 6.52). Implement end-to-end. Test 5 queries.
Bài tập 2: Threshold + metadata (20 phút)
Extend với:
- Return "not found" nếu top score < 0.5
- Store chunk với metadata (file, page)
- Cite metadata trong prompt
Tóm tắt
🎯 6 bước: Chunk → Embed → Store → Query embed → Search → Prompt.
🎯 Cosine similarity drives retrieval. Normalize for dot product shortcut.
🎯 Top K=3-5 đủ cho most use cases.
🎯 Prompt bao gồm context + question với XML tags.
🎯 Production enhancements: hybrid search, re-rank, query rewrite, metadata filter.