Text embeddings — Biến text thành số

7 — RAGTrung cấp20 phút

Keyword search: "car" không match "automobile". Miss.

Bạn sẽ học được
  • Hiểu embedding là gì và cơ chế semantic search
  • Dùng VoyageAI (Anthropic recommended) để generate embeddings
  • Compute cosine similarity giữa embeddings
  • So sánh semantic search vs keyword search

Embedding là gì?

Text → embedding model → list of ~1000-4000 numbers.

Mỗi số represent "một đặc trưng" của text — không interpretable trực tiếp, nhưng tốc độ compute được.

Similar meaning = similar vectors.

"I love coding"  →  [0.23, -0.45, 0.12, ..., 0.67]  (1024 numbers)
"I enjoy programming"  →  [0.21, -0.44, 0.13, ..., 0.66]  (similar!)
"I hate vegetables"  →  [-0.77, 0.33, -0.12, ..., 0.01]  (very different)

VoyageAI — Anthropic recommended

Anthropic chưa cung cấp embedding API. Recommendation: VoyageAI.

Setup

pip install voyageai

Setup

Free tier available: sign up voyageai.com.

Code

# .env
VOYAGE_API_KEY="your_key"

Code

from dotenv import load_dotenv
import voyageai

load_dotenv()
client = voyageai.Client()


def generate_embedding(text: str, input_type: str = "query") -> list[float]:
    """Generate embedding for text."""
    result = client.embed(
        [text],
        model="voyage-3-large",  # recommended general-purpose
        input_type=input_type  # "query" or "document"
    )
    return result.embeddings[0]


# Test
vec = generate_embedding("I love coding")
print(f"Embedding dimension: {len(vec)}")  # 1024
print(f"First 5: {vec[:5]}")

Query vs Document

VoyageAI (and most models) recommend specifying input_type:

Model apply slightly different encoding → better search quality.

  • "document" — for chunks stored in DB
  • "query" — for user questions
# Preprocessing: embed chunks as documents
chunk_embeddings = [generate_embedding(c, input_type="document") for c in chunks]

# Query time: embed query
query_embedding = generate_embedding(user_query, input_type="query")

Cosine similarity

Measure similarity giữa 2 embeddings:

Result:

Example

  • 1.0: identical meaning
  • 0.9-1.0: very similar
  • 0.5-0.8: somewhat related
  • 0-0.4: unrelated
  • <0: opposite meaning (rare)
import numpy as np

def cosine_similarity(vec1, vec2):
    a = np.array(vec1)
    b = np.array(vec2)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Example

vec_car = generate_embedding("I own a car", input_type="query")
vec_auto = generate_embedding("She drives an automobile", input_type="query")
vec_dog = generate_embedding("My dog is friendly", input_type="query")

print(cosine_similarity(vec_car, vec_auto))  # ~0.85 (similar)
print(cosine_similarity(vec_car, vec_dog))   # ~0.15 (unrelated)

Normalization

VoyageAI (and most) auto-normalize vectors to magnitude 1. This means cosine similarity = dot product:

Faster computation.

def cosine_similarity_normalized(vec1, vec2):
    return np.dot(vec1, vec2)  # shorter, same result

Full search example

Claude would then answer based on top chunk.

chunks = [
    "Medical research on XDR-47 bug virus.",
    "Software engineering fixed 100 bugs this quarter.",
    "Marketing team launched new campaign.",
]

# Preprocessing: embed all chunks
chunk_embeddings = [
    generate_embedding(c, input_type="document") for c in chunks
]

# Query
query = "How many software bugs were fixed?"
query_emb = generate_embedding(query, input_type="query")

# Find most similar
similarities = [
    cosine_similarity(query_emb, ce) for ce in chunk_embeddings
]

# Top 1
best_idx = np.argmax(similarities)
print(f"Best match: {chunks[best_idx]}")
print(f"Score: {similarities[best_idx]:.3f}")
# "Software engineering fixed 100 bugs this quarter." (0.87)

# Top K
top_k = sorted(enumerate(similarities), key=lambda x: -x[1])[:3]
for idx, score in top_k:
    print(f"{score:.3f} - {chunks[idx]}")

Models & Dimensions

VoyageAI options:

Domain-specific models better for specialized corpora.

ModelDimBest for
voyage-3-large1024General high quality
voyage-31024Cheap, fast
voyage-code-31024Code
voyage-finance-21024Finance
voyage-law-21024Legal

Cost

VoyageAI pricing (check site):

For 10K chunks × 500 tokens each = 5M tokens. One-time cost: ~$0.50.

Queries ongoing: $0.05 per 1M queries. Very affordable.

  • $0.02 - $0.15 per 1M tokens depending model

Vector database

Store embeddings efficiently:

Simple in-memory

OK cho < 10K chunks in-memory. Bigger → dedicated vector DB.

Production options

Pick based on scale, budget, infrastructure.

  • Pinecone — managed, fast
  • Chroma — open source, simple
  • Qdrant — open source, scalable
  • Weaviate — features rich
  • pgvector — Postgres extension (nếu đã có postgres)
class SimpleVectorStore:
    def __init__(self):
        self.docs = []
        self.embeddings = []
    
    def add(self, content, embedding, metadata=None):
        self.docs.append({"content": content, "metadata": metadata or {}})
        self.embeddings.append(embedding)
    
    def search(self, query_embedding, k=5):
        sims = [cosine_similarity(query_embedding, e) for e in self.embeddings]
        top_k = sorted(enumerate(sims), key=lambda x: -x[1])[:k]
        return [(self.docs[i], sims[i]) for i, _ in top_k]

Anti-patterns

❌ Mix query và document embeddings

Fix: Consistent input_type.

❌ Different models mix

Embed documents với model A, query với model B → incompatible vectors.

Fix: Same model throughout pipeline.

❌ Không normalize before storing

If manually compute, forget normalize → wrong similarity.

Fix: Use library-managed normalization.

❌ Re-embed mỗi query

Re-run preprocessing mỗi query = expensive + slow.

Fix: Preprocessing one-time, store in DB.

# ❌ Wrong: inconsistent input_type
for c in chunks:
    vec = generate_embedding(c, input_type="query")  # should be "document"

Áp dụng ngay

Bài tập 1: Sign up VoyageAI + test (15 phút)

Bài tập 2: Build simple search (30 phút)

Chunk sample doc (bài 6.52). Embed all. Query "specific thing". Retrieve top 3. Verify relevance.

  • Register voyageai.com
  • Get API key, add to .env
  • Run sample code — verify embedding 1024 dim

Tóm tắt

🎯 Embeddings = text → vector of numbers. Semantic encoded in position.

🎯 VoyageAI Anthropic recommended — easy, cheap.

🎯 Cosine similarity measures how similar embeddings are (-1 to 1).

🎯 input_type matters: "query" vs "document" cho better quality.

🎯 Simple in-memory store OK cho < 10K chunks. Vector DB cho production scale.

Nội dung này có hữu ích không?