BM25 lexical search — Hybrid với keyword

7 — RAGTrung cấp20 phút

User search: "What happened with INC-2023-Q4-011?"

Bạn sẽ học được
  • Hiểu khi semantic search fail và BM25 cứu
  • Implement BM25Index
  • Kết hợp semantic + lexical qua hybrid search
  • Dùng rank_bm25 library hoặc implement từ scratch

BM25 = "Best Match 25"

Algorithm from 1990s for information retrieval. Still gold standard cho lexical search.

Cơ chế

Ví dụ

Terms trong query: "INC-2023-Q4-011", "happened".

Document có "INC-2023-Q4-011" → super high BM25 score. Documents without → low.

Precise exact-match search.

  • Tokenize query vào terms: ["inc-2023-q4-011", "happened"]
  • Count term frequency trong corpus
  • IDF weighting: rare terms weight cao
  • Score documents theo (rare term match + term density)
  • Return top K
  • "happened" appears 50 times → low IDF weight
  • "INC-2023-Q4-011" appears 1 time → very high IDF weight

Implementation — Using rank_bm25

pip install rank_bm25

Implementation — Using rank_bm25 (tiếp)

from rank_bm25 import BM25Okapi
import re
from typing import List, Dict, Any, Tuple


class BM25Index:
    """BM25 lexical search index."""
    
    def __init__(self):
        self.docs: List[Dict[str, Any]] = []
        self.bm25 = None
        self._tokenized_corpus = []
    
    def _tokenize(self, text: str) -> List[str]:
        """Lowercase + split, keep alphanumerics."""
        return re.findall(r"[\w-]+", text.lower())
    
    def add_document(self, content: str, metadata: Dict[str, Any] = None):
        self.docs.append({"content": content, "metadata": metadata or {}})
        self._tokenized_corpus.append(self._tokenize(content))
        # Rebuild BM25 (simple approach)
        self.bm25 = BM25Okapi(self._tokenized_corpus)
    
    def add_documents(self, contents: List[str], metadatas: List[Dict] = None):
        if metadatas is None:
            metadatas = [{}] * len(contents)
        for c, m in zip(contents, metadatas):
            self.docs.append({"content": c, "metadata": m})
            self._tokenized_corpus.append(self._tokenize(c))
        self.bm25 = BM25Okapi(self._tokenized_corpus)
    
    def search(self, query: str, k: int = 5) -> List[Tuple[Dict, float]]:
        if self.bm25 is None:
            return []
        
        tokens = self._tokenize(query)
        scores = self.bm25.get_scores(tokens)
        
        top_indices = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
        
        return [
            (self.docs[i], float(scores[i]))
            for i in top_indices
        ]

Usage

Output:

bm25 = BM25Index()

chunks = [
    "Medical research on XDR-47 virus. No IDs mentioned.",
    "Cybersecurity incident INC-2023-Q4-011 was resolved.",
    "Financial Q4 report shows revenue up 12%.",
    "Software team: fixed 142 bugs, no major incidents."
]

bm25.add_documents(chunks, [{"id": i} for i in range(len(chunks))])

# Query
results = bm25.search("INC-2023-Q4-011", k=2)
for doc, score in results:
    print(f"Score: {score:.2f}")
    print(f"Content: {doc['content']}")
    print()

Usage (tiếp)

Exact match → super high score. Others near 0.

Score: 5.43
Content: Cybersecurity incident INC-2023-Q4-011 was resolved.

Score: 0.00
Content: Medical research on XDR-47 virus. ...

Semantic vs BM25: complementary

Best production: Combine both.

SemanticBM25
StrengthSynonyms, conceptsExact terms, IDs
WeaknessMiss rare technical termsMiss paraphrases
Example"car" ↔ "automobile""INC-2023" exact

Hybrid search với Reciprocal Rank Fusion (RRF)

Merge results from multiple indexes:

class HybridRetriever:
    def __init__(self, vector_index, bm25_index):
        self.vector = vector_index
        self.bm25 = bm25_index
    
    def add_document(self, content, metadata=None):
        self.vector.add_document(content, metadata)
        self.bm25.add_document(content, metadata)
    
    def search(self, query: str, k: int = 5, k_rrf: int = 60) -> List[Tuple]:
        # Get results from both
        vector_results = self.vector.search(query, k=k*2)
        bm25_results = self.bm25.search(query, k=k*2)
        
        # RRF scoring
        scores: Dict[str, float] = {}  # doc_id → score
        docs_by_id: Dict[str, Dict] = {}
        
        # Rank vector results
        for rank, (doc, _) in enumerate(vector_results):
            doc_id = str(doc["metadata"])
            docs_by_id[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k_rrf + rank + 1)
        
        # Rank BM25 results
        for rank, (doc, _) in enumerate(bm25_results):
            doc_id = str(doc["metadata"])
            docs_by_id[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k_rrf + rank + 1)
        
        # Sort
        sorted_docs = sorted(scores.items(), key=lambda x: -x[1])
        return [(docs_by_id[doc_id], score) for doc_id, score in sorted_docs[:k]]

Why RRF?

Different scoring scales. Semantic gives [0, 1]. BM25 gives [0, ∞). Can't just average.

RRF uses rank (position), normalized:

k=60 typical. Documents ranking high in both indexes get highest combined score.

Example

Semantic ranks: A=1, B=2, C=3 BM25 ranks: C=1, A=2, B=3

RRF scores (k=60):

A top (high in both). C second. B third.

  • A: 1/61 + 1/62 = 0.0325
  • B: 1/62 + 1/63 = 0.0319
  • C: 1/63 + 1/61 = 0.0323
score(d) = 1 / (k + rank)

Full hybrid setup

Cybersecurity section top (both indexes rank it high). Software section second (semantic related).

# Build both indexes
vector_index = VectorIndex()
bm25_index = BM25Index()

retriever = HybridRetriever(vector_index, bm25_index)

# Add documents
for chunk in chunks:
    retriever.add_document(chunk["content"], chunk["metadata"])

# Query
results = retriever.search("INC-2023-Q4-011 resolved how?", k=3)
for doc, score in results:
    print(f"Score: {score:.4f}")
    print(f"Content: {doc['content']}")
    print()

Tuning

k_rrf parameter

Test với eval set để tune.

Weighted merge

If one source is more reliable, weight it more:

Default equal weight OK for most.

  • k_rrf=1: heavy weight to top-1 từ mỗi source
  • k_rrf=60: typical, balanced
  • k_rrf=100+: flatter, more equal weights
scores[doc_id] = scores.get(doc_id, 0) + 0.7 / (k_rrf + vector_rank + 1)
scores[doc_id] = scores.get(doc_id, 0) + 0.3 / (k_rrf + bm25_rank + 1)

When hybrid matters most

Technical docs: codes, IDs, jargon → BM25 shine.

Legal/medical: clause references, drug codes → exact match critical.

Customer support: product SKUs, ticket IDs.

E-commerce search: brand names, model numbers.

For prose-heavy corpus (news, wikipedia), semantic alone thường đủ.

Anti-patterns

❌ Dùng chỉ BM25

Miss synonyms, paraphrases.

Fix: Hybrid always if có resources.

❌ Không tokenize query same as docs

Query tokenize khác doc tokenize → BM25 miss matches.

Fix: Same tokenization function.

❌ Không rebuild BM25 after adds

Fix: Rebuild sau adds.

self._tokenized_corpus.append(tokens)
# Forget rebuild
return self.bm25.get_scores(...)  # old corpus

Áp dụng ngay

Bài tập 1: Add BM25 to existing RAG (30 phút)

Take bài 6.55 VectorIndex. Add BM25Index. Implement HybridRetriever.

Test query với incident ID / specific code: compare retrieval quality.

Bài tập 2: Eval hybrid (20 phút)

20 test queries mix semantic + exact match. Compare:

Expect: Hybrid wins or ties in 90%+ cases.

  • Vector only
  • BM25 only
  • Hybrid

Tóm tắt

🎯 Semantic + BM25 complementary. Rare technical terms need lexical.

🎯 BM25 = classic lexical algo, weight rare terms high.

🎯 Hybrid search via RRF — combine ranks, not raw scores.

🎯 rank_bm25 library = easy implementation.

🎯 Essential cho: technical docs, IDs, codes, legal, medical.

Nội dung này có hữu ích không?