Thời xưa: PDF → OCR → text → send Claude. Nhiều bước, mất quality.
- Gửi PDF file cho Claude qua base64 hoặc URL
- Claude tự extract: text, hình, chart, bảng — không cần OCR tiền xử lý
- Biết limits và khi nào dùng PDF vs chunking manually
- Case studies: contract review, financial report analysis
Cách gửi PDF
Base64
URL
import base64
with open("earth.pdf", "rb") as f:
file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = [{
"role": "user",
"content": [
{
"type": "document", # ← không phải "image"
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": file_bytes
}
},
{
"type": "text",
"text": "Summarize this document in 3 bullet points."
}
]
}]
response = client.messages.create(
model="claude-sonnet-5-20260205",
max_tokens=2000,
messages=messages
)URL
{
"type": "document",
"source": {
"type": "url",
"url": "https://example.com/paper.pdf"
}
}So với image
PDF block is special — Claude "page-through" document.
| Image block | Document (PDF) block | |
|---|---|---|
| Type | "image" | "document" |
| Media type | image/png, image/jpg | application/pdf |
| Content | Single image | Multi-page document |
| Extraction | Visual content | Text + images + tables + structure |
Limits
Quy tắc ngón cái
- Max size: 32MB per PDF (check docs cập nhật)
- Max pages: 100 pages
- Token cost: Depends on content — text cheap, images expensive
| Pages | Estimated tokens |
|---|---|
| 1 text page | ~500 tokens |
| 10 page report | ~5,000 tokens |
| 100 page paper | ~50,000 tokens |
| Pages với heavy images | 2-3x |
Claude extract được gì
Text
Full text, preserve structure (headers, lists, paragraphs).
Tables
Data + relationships:
Charts / graphs
Interpret visual. Ví dụ bar chart → Claude mô tả trend, values.
Images / diagrams
Describe picture embedded in PDF.
Footnotes, citations
Preserve cross-reference.
Q1 revenue: $100M (+15% YoY)
Q2 revenue: $120M (+20% YoY)Ví dụ: Financial report analysis
Claude read 300-page 10-K, extract structured data.
with open("apple_10k_2025.pdf", "rb") as f:
pdf_bytes = base64.standard_b64encode(f.read()).decode("utf-8")
messages = [{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_bytes
}
},
{
"type": "text",
"text": """Analyze this 10-K report. Extract:
1. Total revenue (current year + YoY growth)
2. Top 3 business segments by revenue
3. Key risks mentioned (3 most important)
4. Cash position
5. R&D spending
Output as JSON."""
}
]
}]
response = chat(messages)Ví dụ: Contract review
Paralegals tiết kiệm 4-8 giờ/contract.
prompt = """Review this contract. For each clause, rate:
- Favorable / Neutral / Unfavorable (for our side as licensee)
- Suggest redline if needed
Focus on:
1. Payment terms
2. Termination
3. IP ownership
4. Liability limits
5. Confidentiality
"""Ví dụ: Research paper summarization
Academic tool. Process 80 papers/literature review in 4 hours.
prompt = """Read this research paper. Output:
1. One-sentence tl;dr
2. Key contributions (3 bullets)
3. Methodology summary (4-6 sentences)
4. Limitations mentioned
5. Open questions for future research
Format: Markdown."""Multi-PDF analysis
Powerful cho literature review, contract compare.
content = [
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf1}},
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf2}},
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf3}},
{"type": "text", "text": "Compare methodologies across 3 papers. Highlight contradictions."}
]When PDF direct vs chunking
PDF direct (Claude reads whole thing):
Manual chunking (RAG — Module 7):
Hybrid: Search (RAG) → identify top 3 docs → send each as PDF to Claude for deep analysis.
- ✅ Small-medium PDFs (<100 pages)
- ✅ Need understanding of structure
- ✅ Charts/images matter
- ✅ Large corpus (1000+ docs)
- ✅ Vector search across documents
- ✅ User queries need retrieval
Prompt caching cho PDF
PDF ~ 50K tokens. Same PDF, multiple queries → cache = 90% cost saving.
Second call với cùng PDF + khác question = read from cache. Details ở bài 6.47-6.49.
messages = [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", ...},
"cache_control": {"type": "ephemeral"} # ← cache PDF
},
{"type": "text", "text": "First question..."}
]
}]Anti-patterns
❌ Gửi scanned poor-quality PDF
PDF pages là low-res images → Claude hallucinate text.
Fix: Improve scan quality hoặc OCR tiền xử lý với good tool.
❌ PDF > 100 pages
Exceed limit → error.
Fix: Split PDF into chunks. Hoặc RAG approach.
❌ Không cache PDF cho repeated query
Same PDF, 10 questions → 10x cost.
Fix: Enable prompt caching.
❌ Prompt vague cho dense PDF
"Summarize" PDF 300 trang → generic summary.
Fix: Specific prompt: "Extract these 5 fields...", "Compare X vs Y...".
Áp dụng ngay
Bài tập 1: Analyze 1 PDF (20 phút)
Download PDF research paper (arxiv.org). Gửi Claude, yêu cầu:
So sánh output với read manually.
Bài tập 2: Multi-PDF compare (20 phút)
2 contract template (online templates free). Send both, ask "find differences".
Observe Claude analyze side-by-side.
- Tl;dr 1 câu
- 3 key contributions
- Methodology
- Limitations
Tóm tắt
🎯 PDF block type "document" — Claude read first-class.
🎯 Max 100 pages, 32MB. Check limits cập nhật.
🎯 Extract everything: text, table, chart, image, structure.
🎯 Use for: contract, financial report, research paper, deep doc analysis.
🎯 RAG (Module 7) khi scale 1000+ docs. PDF direct cho <100 pages focused.