Keyword search: "car" không match "automobile". Miss.
- Hiểu embedding là gì và cơ chế semantic search
- Dùng VoyageAI (Anthropic recommended) để generate embeddings
- Compute cosine similarity giữa embeddings
- So sánh semantic search vs keyword search
Embedding là gì?
Text → embedding model → list of ~1000-4000 numbers.
Mỗi số represent "một đặc trưng" của text — không interpretable trực tiếp, nhưng tốc độ compute được.
Similar meaning = similar vectors.
"I love coding" → [0.23, -0.45, 0.12, ..., 0.67] (1024 numbers)
"I enjoy programming" → [0.21, -0.44, 0.13, ..., 0.66] (similar!)
"I hate vegetables" → [-0.77, 0.33, -0.12, ..., 0.01] (very different)VoyageAI — Anthropic recommended
Anthropic chưa cung cấp embedding API. Recommendation: VoyageAI.
Setup
pip install voyageaiSetup
Free tier available: sign up voyageai.com.
Code
# .env
VOYAGE_API_KEY="your_key"Code
from dotenv import load_dotenv
import voyageai
load_dotenv()
client = voyageai.Client()
def generate_embedding(text: str, input_type: str = "query") -> list[float]:
"""Generate embedding for text."""
result = client.embed(
[text],
model="voyage-3-large", # recommended general-purpose
input_type=input_type # "query" or "document"
)
return result.embeddings[0]
# Test
vec = generate_embedding("I love coding")
print(f"Embedding dimension: {len(vec)}") # 1024
print(f"First 5: {vec[:5]}")Query vs Document
VoyageAI (and most models) recommend specifying input_type:
Model apply slightly different encoding → better search quality.
- "document" — for chunks stored in DB
- "query" — for user questions
# Preprocessing: embed chunks as documents
chunk_embeddings = [generate_embedding(c, input_type="document") for c in chunks]
# Query time: embed query
query_embedding = generate_embedding(user_query, input_type="query")Cosine similarity
Measure similarity giữa 2 embeddings:
Result:
Example
- 1.0: identical meaning
- 0.9-1.0: very similar
- 0.5-0.8: somewhat related
- 0-0.4: unrelated
- <0: opposite meaning (rare)
import numpy as np
def cosine_similarity(vec1, vec2):
a = np.array(vec1)
b = np.array(vec2)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))Example
vec_car = generate_embedding("I own a car", input_type="query")
vec_auto = generate_embedding("She drives an automobile", input_type="query")
vec_dog = generate_embedding("My dog is friendly", input_type="query")
print(cosine_similarity(vec_car, vec_auto)) # ~0.85 (similar)
print(cosine_similarity(vec_car, vec_dog)) # ~0.15 (unrelated)Normalization
VoyageAI (and most) auto-normalize vectors to magnitude 1. This means cosine similarity = dot product:
Faster computation.
def cosine_similarity_normalized(vec1, vec2):
return np.dot(vec1, vec2) # shorter, same resultFull search example
Claude would then answer based on top chunk.
chunks = [
"Medical research on XDR-47 bug virus.",
"Software engineering fixed 100 bugs this quarter.",
"Marketing team launched new campaign.",
]
# Preprocessing: embed all chunks
chunk_embeddings = [
generate_embedding(c, input_type="document") for c in chunks
]
# Query
query = "How many software bugs were fixed?"
query_emb = generate_embedding(query, input_type="query")
# Find most similar
similarities = [
cosine_similarity(query_emb, ce) for ce in chunk_embeddings
]
# Top 1
best_idx = np.argmax(similarities)
print(f"Best match: {chunks[best_idx]}")
print(f"Score: {similarities[best_idx]:.3f}")
# "Software engineering fixed 100 bugs this quarter." (0.87)
# Top K
top_k = sorted(enumerate(similarities), key=lambda x: -x[1])[:3]
for idx, score in top_k:
print(f"{score:.3f} - {chunks[idx]}")Models & Dimensions
VoyageAI options:
Domain-specific models better for specialized corpora.
| Model | Dim | Best for |
|---|---|---|
| voyage-3-large | 1024 | General high quality |
| voyage-3 | 1024 | Cheap, fast |
| voyage-code-3 | 1024 | Code |
| voyage-finance-2 | 1024 | Finance |
| voyage-law-2 | 1024 | Legal |
Cost
VoyageAI pricing (check site):
For 10K chunks × 500 tokens each = 5M tokens. One-time cost: ~$0.50.
Queries ongoing: $0.05 per 1M queries. Very affordable.
- $0.02 - $0.15 per 1M tokens depending model
Vector database
Store embeddings efficiently:
Simple in-memory
OK cho < 10K chunks in-memory. Bigger → dedicated vector DB.
Production options
Pick based on scale, budget, infrastructure.
- Pinecone — managed, fast
- Chroma — open source, simple
- Qdrant — open source, scalable
- Weaviate — features rich
- pgvector — Postgres extension (nếu đã có postgres)
class SimpleVectorStore:
def __init__(self):
self.docs = []
self.embeddings = []
def add(self, content, embedding, metadata=None):
self.docs.append({"content": content, "metadata": metadata or {}})
self.embeddings.append(embedding)
def search(self, query_embedding, k=5):
sims = [cosine_similarity(query_embedding, e) for e in self.embeddings]
top_k = sorted(enumerate(sims), key=lambda x: -x[1])[:k]
return [(self.docs[i], sims[i]) for i, _ in top_k]Anti-patterns
❌ Mix query và document embeddings
Fix: Consistent input_type.
❌ Different models mix
Embed documents với model A, query với model B → incompatible vectors.
Fix: Same model throughout pipeline.
❌ Không normalize before storing
If manually compute, forget normalize → wrong similarity.
Fix: Use library-managed normalization.
❌ Re-embed mỗi query
Re-run preprocessing mỗi query = expensive + slow.
Fix: Preprocessing one-time, store in DB.
# ❌ Wrong: inconsistent input_type
for c in chunks:
vec = generate_embedding(c, input_type="query") # should be "document"Áp dụng ngay
Bài tập 1: Sign up VoyageAI + test (15 phút)
Bài tập 2: Build simple search (30 phút)
Chunk sample doc (bài 6.52). Embed all. Query "specific thing". Retrieve top 3. Verify relevance.
- Register voyageai.com
- Get API key, add to .env
- Run sample code — verify embedding 1024 dim
Tóm tắt
🎯 Embeddings = text → vector of numbers. Semantic encoded in position.
🎯 VoyageAI Anthropic recommended — easy, cheap.
🎯 Cosine similarity measures how similar embeddings are (-1 to 1).
🎯 input_type matters: "query" vs "document" cho better quality.
🎯 Simple in-memory store OK cho < 10K chunks. Vector DB cho production scale.