Text chunking strategies

7 — RAGTrung cấp20 phút

Doc có 2 section: Medical Research và Software Engineering.

Bạn sẽ học được
  • So sánh 4 chiến lược chunking: size, structure, sentence, semantic
  • Implement size-based chunking với overlap
  • Implement structure-based chunking cho Markdown
  • Chọn strategy phù hợp theo document type

4 chiến lược

1. Size-based

Cắt text thành chunks ~N ký tự/tokens equal.

Pros: Simple, universal (code, text, any format). Cons: Words cut mid-sentence, context lost.

Overlap critical — ~10-20% chunk size. Giúp info cross boundaries preserved.

2. Structure-based (Markdown, HTML)

Split theo headers, sections.

Pros: Chunks có semantic meaning, aligned với author intent. Cons: Requires structured input (MD/HTML). Fail cho plain text.

def chunk_by_char(text, chunk_size=1500, chunk_overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start = end - chunk_overlap if end < len(text) else len(text)
    return chunks

2. Structure-based (Markdown, HTML)

3. Sentence-based

Split thành câu, group N câu/chunk với overlap câu.

Pros: Respect sentence boundaries. Middle ground. Cons: Sentences có thể vary lớn size.

import re

def chunk_by_section(markdown_text):
    # Split on H2 headers
    pattern = r"\n## "
    sections = re.split(pattern, markdown_text)
    # Re-add header marker (except first)
    return [sections[0]] + [f"## {s}" for s in sections[1:]]

3. Sentence-based

4. Semantic chunking

Đo similarity giữa sentences → group related ones.

Pros: Most relevant chunks. Cons: Expensive (computing similarity), complex.

def chunk_by_sentence(text, max_sentences=5, overlap=1):
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks = []
    start = 0
    while start < len(sentences):
        end = min(start + max_sentences, len(sentences))
        chunks.append(" ".join(sentences[start:end]))
        start += max_sentences - overlap
    return chunks

4. Semantic chunking

from sentence_transformers import SentenceTransformer

def chunk_semantic(text, similarity_threshold=0.7):
    sentences = re.split(r"(?<=[.!?])\s+", text)
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(sentences)
    
    chunks = []
    current = [sentences[0]]
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity > similarity_threshold:
            current.append(sentences[i])
        else:
            chunks.append(" ".join(current))
            current = [sentences[i]]
    chunks.append(" ".join(current))
    return chunks

So sánh

StrategyEaseQualitySpeedBest for
Size⭐⭐⭐⭐⭐⭐Any content, code
Structure⭐⭐⭐⭐⭐⭐⭐⭐Markdown, HTML
Sentence⭐⭐⭐⭐⭐⭐⭐⭐Prose text
Semantic⭐⭐⭐⭐Production quality

Practical recommendations

Starting fresh

Production case studies

Legal docs (PDF): Structure-based theo section/clause. Each clause = 1 chunk.

Codebase: Size-based per function. Preserve whole functions.

Customer support KB (Markdown): Structure-based per article + sub-section.

Chat transcripts: Sentence-based, group 10 sentences.

  • Default: Size-based với overlap (chunk_size=1500, overlap=200 chars)
  • Know Markdown? Structure-based cho internal docs
  • Production: Try size → eval → upgrade to semantic if needed

Chunk size sweet spot

Most production: 1500 tokens + 200 overlap.

SizeProsCons
200-500 tokensPrecise retrievalLose cross-paragraph context
1000-2000 tokensBalanceTypical best
4000+ tokensLots of contextRetrieval noisy, less precision

Eval chunks

Always verify chunks make sense:

Check:

  • Complete sentences (mostly)
  • Headers paired với content
  • No random splits mid-word
chunks = chunk_by_char(doc_text, 1500, 200)
for i, chunk in enumerate(chunks[:3]):
    print(f"=== Chunk {i} ({len(chunk)} chars) ===")
    print(chunk[:200] + "...")
    print()

Metadata với chunks

Thêm metadata cho retrieval quality:

Benefits:

  • Filter by source/section trước search
  • Cite source trong response
  • Debug retrieval
chunks_with_meta = [{
    "content": chunk_text,
    "source": "doc1.md",
    "section": "introduction",
    "page": 5,
    "chunk_index": 0,
}]

Anti-patterns

❌ No overlap

Info cắt giữa → chunk missing context.

Fix: Overlap 10-20%.

❌ Chunk quá nhỏ (< 200 tokens)

Retrieval returns tons of tiny chunks, no coherent info.

Fix: 500+ tokens minimum.

❌ Chunk quá lớn (> 4000)

Retrieval less precise, Claude process more tokens.

Fix: Keep 1500-2000.

❌ Không re-chunk khi doc update

Stale chunks → outdated info.

Fix: Re-chunk + re-embed on update.

❌ Chunk code size-based mid-function

Code function bị cắt → Claude không parse được.

Fix: Chunk code theo function/class boundary.

Áp dụng ngay

Bài tập 1: Chunk sample doc (20 phút)

Lấy doc markdown (e.g., report.md). Try:

Compare chunks. Which preserves meaning better?

Bài tập 2: Test with query (20 phút)

Cho mỗi strategy, test:

  • Size-based chunk_size=1000
  • Structure-based by H2
  • Count chunks
  • Median chunk size
  • Look at 3 random chunks — self-contained?

Tóm tắt

🎯 4 strategies: size, structure, sentence, semantic.

🎯 Size + overlap default — work anywhere.

🎯 Structure tốt nhất cho Markdown/HTML.

🎯 Metadata với chunks — source, section, page cho citation.

🎯 Chunk 1500 + overlap 200 là sweet spot 80% cases.

Nội dung này có hữu ích không?