Real RAG systems có 3-5 indexes: - Semantic (embeddings) - BM25 (keyword) - Title/metadata specific - Domain-specific (finance, legal terms)
- Design Retriever wrap multiple SearchIndex
- Add index types (keyword, domain-specific) without refactor
- Understand RRF math for ranking fusion
- Ship extensible RAG architecture
SearchIndex protocol
Use Python Protocol để define interface:
Any class with these 2 methods = valid SearchIndex.
VectorIndex, BM25Index both conform.
from typing import Protocol, List, Tuple, Dict, Any
class SearchIndex(Protocol):
def add_document(self, content: str, metadata: Dict[str, Any] = None) -> None:
...
def search(self, query: str, k: int = 5) -> List[Tuple[Dict, float]]:
...Retriever — Universal coordinator
class Retriever:
"""Wraps multiple SearchIndex, fuses results via RRF."""
def __init__(self, *indexes: SearchIndex):
if len(indexes) == 0:
raise ValueError("At least one index required")
self._indexes = list(indexes)
def add_document(self, content: str, metadata: Dict = None):
"""Add to all indexes."""
for index in self._indexes:
index.add_document(content, metadata)
def add_documents(self, contents: List[str], metadatas: List[Dict] = None):
"""Batch add to all."""
if metadatas is None:
metadatas = [{}] * len(contents)
for content, meta in zip(contents, metadatas):
for index in self._indexes:
index.add_document(content, meta)
def search(
self,
query: str,
k: int = 5,
k_rrf: int = 60
) -> List[Tuple[Dict, float]]:
"""RRF fusion across all indexes."""
# Collect from each
all_results = [
index.search(query, k=k*2) # over-fetch, fuse cuts
for index in self._indexes
]
# Score by RRF
scores: Dict[str, float] = {}
docs_by_id: Dict[str, Dict] = {}
for index_results in all_results:
for rank, (doc, _) in enumerate(index_results):
# Use content hash as ID (or metadata if consistent)
doc_id = hash(doc["content"])
docs_by_id[doc_id] = doc
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k_rrf + rank + 1)
# Sort by combined score
sorted_ids = sorted(scores.keys(), key=lambda x: -scores[x])[:k]
return [(docs_by_id[doc_id], scores[doc_id]) for doc_id in sorted_ids]Usage
# Create indexes
vector_idx = VectorIndex()
bm25_idx = BM25Index()
# Wrap
retriever = Retriever(vector_idx, bm25_idx)
# Add documents once, propagates to both
retriever.add_documents(chunks_content, metadatas)
# Search fused
results = retriever.search("INC-2023-Q4-011 resolution", k=5)Add new index — no refactor
Want add TitleIndex (search doc titles exact)?
Retriever code unchanged. Power of protocol-based design.
class TitleIndex:
"""Search by exact title match."""
def __init__(self):
self.docs = []
def add_document(self, content, metadata=None):
self.docs.append({"content": content, "metadata": metadata or {}})
def search(self, query, k=5):
query_lower = query.lower()
results = []
for doc in self.docs:
title = doc["metadata"].get("title", "").lower()
if query_lower in title:
results.append((doc, 1.0))
return results[:k]
# Plug in
retriever = Retriever(vector_idx, bm25_idx, title_idx)RRF deep dive
Formula
Where:
Why this formula?
Example
3 indexes, document D appears:
RRF(D) = 1/61 + 1/63 + 0 = 0.0324
Tuning k_rrf
Default 60 works for most.
- rank_i(d) = rank of document d in index i (1-indexed)
- k_rrf = constant, typical 60
- Divide by rank → diminishing returns. Being #1 much better than #10.
- Constant k_rrf → smooth, avoid divide-by-zero.
- Sum across indexes → documents appear in multiple → higher combined.
- Index 1: rank 1 → 1/61
- Index 2: rank 3 → 1/63
- Index 3: not in top K → 0
- k_rrf=1: top-1 heavily weighted
- k_rrf=60: balanced
- k_rrf=100+: very smooth, more equal
RRF_score(d) = Σ_i 1 / (k_rrf + rank_i(d))Hybrid search improvements
1. Index-specific weights
Trust some indexes more:
Example: vector=1.0, bm25=0.5 → semantic more trusted.
2. Query-dependent routing
Some queries best for specific index:
def search(self, query, k=5, weights=None):
if weights is None:
weights = [1.0] * len(self._indexes)
scores = {}
for index, weight, results in zip(self._indexes, weights, all_results):
for rank, (doc, _) in enumerate(results):
doc_id = hash(doc["content"])
scores[doc_id] = scores.get(doc_id, 0) + weight / (60 + rank + 1)
# ...2. Query-dependent routing
3. Re-ranking with Claude
After retrieve top K (larger), use Claude rerank:
def smart_search(query, retriever):
# If query has IDs / specific codes, favor BM25
if re.search(r"[A-Z]{2,}-\d+", query):
return retriever.search(query, weights=[0.3, 1.0, 0.2])
# Else balanced
return retriever.search(query, weights=[1.0, 1.0, 0.5])3. Re-ranking with Claude
Costs extra Claude call per query but boosts quality.
def claude_rerank(query, candidates, k=5):
candidates_text = "\n\n".join(
f"[{i}] {c['content']}" for i, c in enumerate(candidates)
)
prompt = f"""Rank these candidates by relevance to query.
Query: {query}
<candidates>
{candidates_text}
</candidates>
Return JSON: {{"ranking": [top_indices]}}"""
response = anthropic.messages.create(...)
ranking = json.loads(response.content[0].text)
return [candidates[i] for i in ranking["ranking"][:k]]Production architecture
Each component swappable + testable.
User query
│
▼
Query rewrite (optional, Claude)
│
▼
┌─── Retriever ──────────────────┐
│ │
│ Vector Index ← 20 chunks │
│ BM25 Index ← 20 chunks │
│ Title Index ← 20 chunks │
│ │
│ RRF fusion → top 10 │
└────────────────────────────────┘
│
▼
Claude rerank (optional)
│
▼
Top 3-5 chunks
│
▼
Prompt + Claude
│
▼
Answer + citationsEval each component
Typical:
Significant improvement from fusion.
- Vector: 75%
- BM25: 68%
- Hybrid: 87%
def eval_retrieval_pipeline(queries_with_expected, retriever):
"""Measure recall@K."""
correct_at_k = {k: 0 for k in [1, 3, 5, 10]}
for test in queries_with_expected:
results = retriever.search(test["query"], k=10)
retrieved_sources = [doc["metadata"]["source"] for doc, _ in results]
for k in correct_at_k:
if test["expected_source"] in retrieved_sources[:k]:
correct_at_k[k] += 1
n = len(queries_with_expected)
return {k: correct_at_k[k] / n for k in correct_at_k}
# Compare
accuracy_vector = eval_retrieval_pipeline(queries, Retriever(vector_idx))
accuracy_bm25 = eval_retrieval_pipeline(queries, Retriever(bm25_idx))
accuracy_hybrid = eval_retrieval_pipeline(queries, Retriever(vector_idx, bm25_idx))
print(f"Vector only: recall@5 = {accuracy_vector[5]:.0%}")
print(f"BM25 only: recall@5 = {accuracy_bm25[5]:.0%}")
print(f"Hybrid: recall@5 = {accuracy_hybrid[5]:.0%}")Anti-patterns
❌ Same index inside Retriever twice
Doesn't add value, inflates scores.
Fix: Unique indexes.
❌ Not all documents in all indexes
Doc in vector but not BM25 → can't match exact terms.
Fix: Retriever.add propagates to ALL indexes.
❌ Scoring scales mismatch
Mix raw scores từ different indexes → biased ranking.
Fix: Use RRF (ranks), not raw scores.
❌ Over-engineering
Start với 1 index. Add BM25 if eval shows gap. Don't jump to 5 indexes Day 1.
Fix: Eval-driven.
Áp dụng ngay
Bài tập 1: Implement Retriever (30 phút)
Wrap vector + BM25 in Retriever. Test RRF fusion gives reasonable ranking.
Bài tập 2: Add TitleIndex (20 phút)
Custom index for exact title match. Plug into Retriever. Test.
Tóm tắt
🎯 SearchIndex protocol unified interface.
🎯 Retriever wraps multiple indexes, RRF fuses.
🎯 RRF scoring uses ranks (not raw scores) → scale-invariant.
🎯 Extensible: add new index type without refactor.
🎯 Hybrid typically 10-15% higher recall than best single index.