User search: "What happened with INC-2023-Q4-011?"
- Hiểu khi semantic search fail và BM25 cứu
- Implement BM25Index
- Kết hợp semantic + lexical qua hybrid search
- Dùng rank_bm25 library hoặc implement từ scratch
BM25 = "Best Match 25"
Algorithm from 1990s for information retrieval. Still gold standard cho lexical search.
Cơ chế
Ví dụ
Terms trong query: "INC-2023-Q4-011", "happened".
Document có "INC-2023-Q4-011" → super high BM25 score. Documents without → low.
Precise exact-match search.
- Tokenize query vào terms: ["inc-2023-q4-011", "happened"]
- Count term frequency trong corpus
- IDF weighting: rare terms weight cao
- Score documents theo (rare term match + term density)
- Return top K
- "happened" appears 50 times → low IDF weight
- "INC-2023-Q4-011" appears 1 time → very high IDF weight
Implementation — Using rank_bm25
pip install rank_bm25Implementation — Using rank_bm25 (tiếp)
from rank_bm25 import BM25Okapi
import re
from typing import List, Dict, Any, Tuple
class BM25Index:
"""BM25 lexical search index."""
def __init__(self):
self.docs: List[Dict[str, Any]] = []
self.bm25 = None
self._tokenized_corpus = []
def _tokenize(self, text: str) -> List[str]:
"""Lowercase + split, keep alphanumerics."""
return re.findall(r"[\w-]+", text.lower())
def add_document(self, content: str, metadata: Dict[str, Any] = None):
self.docs.append({"content": content, "metadata": metadata or {}})
self._tokenized_corpus.append(self._tokenize(content))
# Rebuild BM25 (simple approach)
self.bm25 = BM25Okapi(self._tokenized_corpus)
def add_documents(self, contents: List[str], metadatas: List[Dict] = None):
if metadatas is None:
metadatas = [{}] * len(contents)
for c, m in zip(contents, metadatas):
self.docs.append({"content": c, "metadata": m})
self._tokenized_corpus.append(self._tokenize(c))
self.bm25 = BM25Okapi(self._tokenized_corpus)
def search(self, query: str, k: int = 5) -> List[Tuple[Dict, float]]:
if self.bm25 is None:
return []
tokens = self._tokenize(query)
scores = self.bm25.get_scores(tokens)
top_indices = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
return [
(self.docs[i], float(scores[i]))
for i in top_indices
]Usage
Output:
bm25 = BM25Index()
chunks = [
"Medical research on XDR-47 virus. No IDs mentioned.",
"Cybersecurity incident INC-2023-Q4-011 was resolved.",
"Financial Q4 report shows revenue up 12%.",
"Software team: fixed 142 bugs, no major incidents."
]
bm25.add_documents(chunks, [{"id": i} for i in range(len(chunks))])
# Query
results = bm25.search("INC-2023-Q4-011", k=2)
for doc, score in results:
print(f"Score: {score:.2f}")
print(f"Content: {doc['content']}")
print()Usage (tiếp)
Exact match → super high score. Others near 0.
Score: 5.43
Content: Cybersecurity incident INC-2023-Q4-011 was resolved.
Score: 0.00
Content: Medical research on XDR-47 virus. ...Semantic vs BM25: complementary
Best production: Combine both.
| Semantic | BM25 | |
|---|---|---|
| Strength | Synonyms, concepts | Exact terms, IDs |
| Weakness | Miss rare technical terms | Miss paraphrases |
| Example | "car" ↔ "automobile" | "INC-2023" exact |
Hybrid search với Reciprocal Rank Fusion (RRF)
Merge results from multiple indexes:
class HybridRetriever:
def __init__(self, vector_index, bm25_index):
self.vector = vector_index
self.bm25 = bm25_index
def add_document(self, content, metadata=None):
self.vector.add_document(content, metadata)
self.bm25.add_document(content, metadata)
def search(self, query: str, k: int = 5, k_rrf: int = 60) -> List[Tuple]:
# Get results from both
vector_results = self.vector.search(query, k=k*2)
bm25_results = self.bm25.search(query, k=k*2)
# RRF scoring
scores: Dict[str, float] = {} # doc_id → score
docs_by_id: Dict[str, Dict] = {}
# Rank vector results
for rank, (doc, _) in enumerate(vector_results):
doc_id = str(doc["metadata"])
docs_by_id[doc_id] = doc
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k_rrf + rank + 1)
# Rank BM25 results
for rank, (doc, _) in enumerate(bm25_results):
doc_id = str(doc["metadata"])
docs_by_id[doc_id] = doc
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k_rrf + rank + 1)
# Sort
sorted_docs = sorted(scores.items(), key=lambda x: -x[1])
return [(docs_by_id[doc_id], score) for doc_id, score in sorted_docs[:k]]Why RRF?
Different scoring scales. Semantic gives [0, 1]. BM25 gives [0, ∞). Can't just average.
RRF uses rank (position), normalized:
k=60 typical. Documents ranking high in both indexes get highest combined score.
Example
Semantic ranks: A=1, B=2, C=3 BM25 ranks: C=1, A=2, B=3
RRF scores (k=60):
A top (high in both). C second. B third.
- A: 1/61 + 1/62 = 0.0325
- B: 1/62 + 1/63 = 0.0319
- C: 1/63 + 1/61 = 0.0323
score(d) = 1 / (k + rank)Full hybrid setup
Cybersecurity section top (both indexes rank it high). Software section second (semantic related).
# Build both indexes
vector_index = VectorIndex()
bm25_index = BM25Index()
retriever = HybridRetriever(vector_index, bm25_index)
# Add documents
for chunk in chunks:
retriever.add_document(chunk["content"], chunk["metadata"])
# Query
results = retriever.search("INC-2023-Q4-011 resolved how?", k=3)
for doc, score in results:
print(f"Score: {score:.4f}")
print(f"Content: {doc['content']}")
print()Tuning
k_rrf parameter
Test với eval set để tune.
Weighted merge
If one source is more reliable, weight it more:
Default equal weight OK for most.
- k_rrf=1: heavy weight to top-1 từ mỗi source
- k_rrf=60: typical, balanced
- k_rrf=100+: flatter, more equal weights
scores[doc_id] = scores.get(doc_id, 0) + 0.7 / (k_rrf + vector_rank + 1)
scores[doc_id] = scores.get(doc_id, 0) + 0.3 / (k_rrf + bm25_rank + 1)When hybrid matters most
Technical docs: codes, IDs, jargon → BM25 shine.
Legal/medical: clause references, drug codes → exact match critical.
Customer support: product SKUs, ticket IDs.
E-commerce search: brand names, model numbers.
For prose-heavy corpus (news, wikipedia), semantic alone thường đủ.
Anti-patterns
❌ Dùng chỉ BM25
Miss synonyms, paraphrases.
Fix: Hybrid always if có resources.
❌ Không tokenize query same as docs
Query tokenize khác doc tokenize → BM25 miss matches.
Fix: Same tokenization function.
❌ Không rebuild BM25 after adds
Fix: Rebuild sau adds.
self._tokenized_corpus.append(tokens)
# Forget rebuild
return self.bm25.get_scores(...) # old corpusÁp dụng ngay
Bài tập 1: Add BM25 to existing RAG (30 phút)
Take bài 6.55 VectorIndex. Add BM25Index. Implement HybridRetriever.
Test query với incident ID / specific code: compare retrieval quality.
Bài tập 2: Eval hybrid (20 phút)
20 test queries mix semantic + exact match. Compare:
Expect: Hybrid wins or ties in 90%+ cases.
- Vector only
- BM25 only
- Hybrid
Tóm tắt
🎯 Semantic + BM25 complementary. Rare technical terms need lexical.
🎯 BM25 = classic lexical algo, weight rare terms high.
🎯 Hybrid search via RRF — combine ranks, not raw scores.
🎯 rank_bm25 library = easy implementation.
🎯 Essential cho: technical docs, IDs, codes, legal, medical.