Doc có 2 section: Medical Research và Software Engineering.
- So sánh 4 chiến lược chunking: size, structure, sentence, semantic
- Implement size-based chunking với overlap
- Implement structure-based chunking cho Markdown
- Chọn strategy phù hợp theo document type
4 chiến lược
1. Size-based
Cắt text thành chunks ~N ký tự/tokens equal.
Pros: Simple, universal (code, text, any format). Cons: Words cut mid-sentence, context lost.
Overlap critical — ~10-20% chunk size. Giúp info cross boundaries preserved.
2. Structure-based (Markdown, HTML)
Split theo headers, sections.
Pros: Chunks có semantic meaning, aligned với author intent. Cons: Requires structured input (MD/HTML). Fail cho plain text.
def chunk_by_char(text, chunk_size=1500, chunk_overlap=200):
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start = end - chunk_overlap if end < len(text) else len(text)
return chunks2. Structure-based (Markdown, HTML)
3. Sentence-based
Split thành câu, group N câu/chunk với overlap câu.
Pros: Respect sentence boundaries. Middle ground. Cons: Sentences có thể vary lớn size.
import re
def chunk_by_section(markdown_text):
# Split on H2 headers
pattern = r"\n## "
sections = re.split(pattern, markdown_text)
# Re-add header marker (except first)
return [sections[0]] + [f"## {s}" for s in sections[1:]]3. Sentence-based
4. Semantic chunking
Đo similarity giữa sentences → group related ones.
Pros: Most relevant chunks. Cons: Expensive (computing similarity), complex.
def chunk_by_sentence(text, max_sentences=5, overlap=1):
sentences = re.split(r"(?<=[.!?])\s+", text)
chunks = []
start = 0
while start < len(sentences):
end = min(start + max_sentences, len(sentences))
chunks.append(" ".join(sentences[start:end]))
start += max_sentences - overlap
return chunks4. Semantic chunking
from sentence_transformers import SentenceTransformer
def chunk_semantic(text, similarity_threshold=0.7):
sentences = re.split(r"(?<=[.!?])\s+", text)
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(sentences)
chunks = []
current = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity > similarity_threshold:
current.append(sentences[i])
else:
chunks.append(" ".join(current))
current = [sentences[i]]
chunks.append(" ".join(current))
return chunksSo sánh
| Strategy | Ease | Quality | Speed | Best for |
|---|---|---|---|---|
| Size | ⭐⭐⭐ | ⭐ | ⭐⭐⭐ | Any content, code |
| Structure | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | Markdown, HTML |
| Sentence | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | Prose text |
| Semantic | ⭐ | ⭐⭐⭐⭐ | ⭐ | Production quality |
Practical recommendations
Starting fresh
Production case studies
Legal docs (PDF): Structure-based theo section/clause. Each clause = 1 chunk.
Codebase: Size-based per function. Preserve whole functions.
Customer support KB (Markdown): Structure-based per article + sub-section.
Chat transcripts: Sentence-based, group 10 sentences.
- Default: Size-based với overlap (chunk_size=1500, overlap=200 chars)
- Know Markdown? Structure-based cho internal docs
- Production: Try size → eval → upgrade to semantic if needed
Chunk size sweet spot
Most production: 1500 tokens + 200 overlap.
| Size | Pros | Cons |
|---|---|---|
| 200-500 tokens | Precise retrieval | Lose cross-paragraph context |
| 1000-2000 tokens | Balance | Typical best |
| 4000+ tokens | Lots of context | Retrieval noisy, less precision |
Eval chunks
Always verify chunks make sense:
Check:
- Complete sentences (mostly)
- Headers paired với content
- No random splits mid-word
chunks = chunk_by_char(doc_text, 1500, 200)
for i, chunk in enumerate(chunks[:3]):
print(f"=== Chunk {i} ({len(chunk)} chars) ===")
print(chunk[:200] + "...")
print()Metadata với chunks
Thêm metadata cho retrieval quality:
Benefits:
- Filter by source/section trước search
- Cite source trong response
- Debug retrieval
chunks_with_meta = [{
"content": chunk_text,
"source": "doc1.md",
"section": "introduction",
"page": 5,
"chunk_index": 0,
}]Anti-patterns
❌ No overlap
Info cắt giữa → chunk missing context.
Fix: Overlap 10-20%.
❌ Chunk quá nhỏ (< 200 tokens)
Retrieval returns tons of tiny chunks, no coherent info.
Fix: 500+ tokens minimum.
❌ Chunk quá lớn (> 4000)
Retrieval less precise, Claude process more tokens.
Fix: Keep 1500-2000.
❌ Không re-chunk khi doc update
Stale chunks → outdated info.
Fix: Re-chunk + re-embed on update.
❌ Chunk code size-based mid-function
Code function bị cắt → Claude không parse được.
Fix: Chunk code theo function/class boundary.
Áp dụng ngay
Bài tập 1: Chunk sample doc (20 phút)
Lấy doc markdown (e.g., report.md). Try:
Compare chunks. Which preserves meaning better?
Bài tập 2: Test with query (20 phút)
Cho mỗi strategy, test:
- Size-based chunk_size=1000
- Structure-based by H2
- Count chunks
- Median chunk size
- Look at 3 random chunks — self-contained?
Tóm tắt
🎯 4 strategies: size, structure, sentence, semantic.
🎯 Size + overlap default — work anywhere.
🎯 Structure tốt nhất cho Markdown/HTML.
🎯 Metadata với chunks — source, section, page cho citation.
🎯 Chunk 1500 + overlap 200 là sweet spot 80% cases.