The full RAG flow — End-to-end example — Building with the Claude API

Bạn sẽ học được

Walk through 6 bước của complete RAG pipeline
Hiểu cosine similarity và distance
Build prompt với retrieved context
Ship minimal working RAG

6-step RAG pipeline

Walk through example.

┌────────────────────────────────────────────────────┐
│                                                    │
│  PREPROCESS (once):                                │
│   1. Chunk document                                │
│   2. Embed chunks                                  │
│   3. Store in vector DB                            │
│                                                    │
│  QUERY (each request):                             │
│   4. Embed user query                              │
│   5. Find similar chunks (top K)                   │
│   6. Build prompt, call Claude                     │
│                                                    │
└────────────────────────────────────────────────────┘

Setup

from dotenv import load_dotenv
load_dotenv()

import voyageai
from anthropic import Anthropic
import numpy as np

voyage = voyageai.Client()
anthropic = Anthropic()
MODEL = "claude-sonnet-5-20260205"

Step 1: Chunk

Example corpus:

document = """
## Section 1: Medical Research
This year saw significant strides in our understanding of XDR-47, 
a 'bug' we have not seen before. We developed a new treatment protocol
that reduced mortality by 23% in clinical trials.

## Section 2: Software Engineering
This division dedicated significant effort to studying various 
infection vectors in our distributed systems. Team fixed 142 critical 
bugs and improved system uptime to 99.97%.

## Section 3: Marketing
Q4 campaigns generated 3M impressions, 50K new leads. 
Conversion rate improved to 4.2%.
"""

# Chunk by section
import re
chunks = re.split(r"\n## ", document.strip())
chunks = [c.strip() for c in chunks if c.strip()]
# 3 chunks

Step 2: Embed chunks

def embed(text, input_type="document"):
    result = voyage.embed([text], model="voyage-3-large", input_type=input_type)
    return result.embeddings[0]


chunk_embeddings = [embed(c, input_type="document") for c in chunks]
# 3 embeddings, 1024-dim each

Step 3: Store (simple in-memory)

Preprocess done. Now query time.

class VectorStore:
    def __init__(self):
        self.docs = []
        self.embeddings = []
    
    def add(self, content, embedding, metadata=None):
        self.docs.append({"content": content, "metadata": metadata or {}})
        self.embeddings.append(np.array(embedding))
    
    def search(self, query_emb, k=3):
        query_vec = np.array(query_emb)
        # Normalize
        query_vec = query_vec / np.linalg.norm(query_vec)
        
        sims = []
        for emb in self.embeddings:
            emb_norm = emb / np.linalg.norm(emb)
            sims.append(float(np.dot(query_vec, emb_norm)))
        
        top_k = sorted(enumerate(sims), key=lambda x: -x[1])[:k]
        return [(self.docs[i], score) for i, score in top_k]


store = VectorStore()
for i, (chunk, emb) in enumerate(zip(chunks, chunk_embeddings)):
    store.add(chunk, emb, metadata={"section": i + 1})

print(f"Stored {len(store.docs)} chunks")

Step 4: Embed query

query = "How many software bugs did engineers fix?"
query_emb = embed(query, input_type="query")

Step 5: Find similar

Output:

results = store.search(query_emb, k=2)

for doc, score in results:
    print(f"Score: {score:.3f}")
    print(f"Content: {doc['content'][:100]}...")
    print()

Step 5: Find similar (tiếp)

Software section top (correct). Medical section lower (false positive on word "bug", but lower score).

Score: 0.873
Content: Section 2: Software Engineering... Team fixed 142 critical bugs...

Score: 0.412
Content: Section 1: Medical Research... a 'bug' we have not seen before...

Step 6: Build prompt, call Claude

Output:

def rag_query(query: str, store: VectorStore, k: int = 3) -> str:
    # 1. Embed
    query_emb = embed(query, input_type="query")
    
    # 2. Retrieve
    results = store.search(query_emb, k=k)
    
    # 3. Build context
    context = "\n\n---\n\n".join(
        doc["content"] for doc, _ in results
    )
    
    # 4. Prompt
    prompt = f"""Answer the user's question based on provided context.

<context>
{context}
</context>

<question>
{query}
</question>

Instructions:
- Answer based on context only
- If context doesn't contain answer, say "Not found in provided info"
- Cite source if possible"""
    
    # 5. Call Claude
    response = anthropic.messages.create(
        model=MODEL,
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text


# Test
answer = rag_query("How many software bugs were fixed?", store)
print(answer)

Step 6: Build prompt, call Claude (tiếp)

Correct + source-grounded.

The software engineering team fixed 142 critical bugs this year, 
based on the Software Engineering section of the report.

Cosine similarity deep dive

Example: 2D case for intuition.

In N-dim (N=1024), same math, capture nuanced semantic distinctions.

Vec A: [0.9, 0.3]   (mostly "topic 1")
Vec B: [0.8, 0.4]   (mostly "topic 1")  
Vec C: [0.1, 0.95]  (mostly "topic 2")

A vs B: cosine = 0.97 (very similar)
A vs C: cosine = 0.37 (different)
B vs C: cosine = 0.52 (slightly related)

Cosine distance

Alternative metric:

Vector DBs often use distance (smaller = closer). Same info, different scale.

0 = identical
1 = unrelated
2 = opposite

cosine_distance = 1 - cosine_similarity

Testing the end-to-end

Try edge queries:

Eval to verify retrieval quality.

# On-topic
rag_query("What's XDR-47?", store)
# → "XDR-47 is a newly-discovered pathogen..."

# Cross-topic confusion
rag_query("Tell me about bugs.", store)
# → Retrieves both medical (XDR-47 "bug") + software bugs
# → Claude synthesizes or clarifies

# Off-topic
rag_query("What's the weather?", store)
# → "Not found in provided info"

Production improvements

Basic RAG has failures. Add:

1. Top K tuning

Test với eval set.

2. Hybrid search (Bài 6.56)

Combine semantic + keyword (BM25) cho robust.

3. Re-ranking

After retrieve top 20, use Claude để rank top 5 most relevant.

4. Query rewriting

Complex query → Claude rewrites:

K=1: precise but can miss
K=5: balanced
K=10: more context, may include noise

rerank_prompt = f"""Given this question: {query}
Rank these candidates by relevance (1=best):
<candidates>
{candidates}
</candidates>
"""

4. Query rewriting

Rewrite before embedding.

5. Metadata filtering

Filter trước search:

"What about the bugs I mentioned earlier?"
→ "Software bugs fixed in Q4 2024"

5. Metadata filtering

results = store.search(query_emb, k=5, filter={"section": "Technical"})

Anti-patterns

❌ Forget input_type

Query dùng "document" input_type → similarity noisy.

Fix: Query→"query", chunk→"document".

❌ Context window overflow

Retrieve 20 chunks × 2000 tokens = 40K tokens. Claude chậm + expensive.

Fix: K=3-5 đủ. Chunk vừa phải.

❌ Không check score threshold

Top 1 có score 0.2 (very weak) vẫn dùng.

Fix: If top score < 0.5, return "not found".

❌ Re-embed corpus mỗi query

Slow + expensive.

Fix: Preprocess embeddings, save to disk/DB.

Áp dụng ngay

Bài tập 1: Run basic RAG (30 phút)

Sample doc (report.md từ bài 6.52). Implement end-to-end. Test 5 queries.

Bài tập 2: Threshold + metadata (20 phút)

Extend với:

Return "not found" nếu top score < 0.5
Store chunk với metadata (file, page)
Cite metadata trong prompt

Tóm tắt

🎯 6 bước: Chunk → Embed → Store → Query embed → Search → Prompt.

🎯 Cosine similarity drives retrieval. Normalize for dot product shortcut.

🎯 Top K=3-5 đủ cho most use cases.

🎯 Prompt bao gồm context + question với XML tags.

🎯 Production enhancements: hybrid search, re-rank, query rewrite, metadata filter.

Nội dung này có hữu ích không?