PDF support — Đọc tài liệu thẳng từ PDF — Building with the Claude API

Thời xưa: PDF → OCR → text → send Claude. Nhiều bước, mất quality.

Bạn sẽ học được

Gửi PDF file cho Claude qua base64 hoặc URL
Claude tự extract: text, hình, chart, bảng — không cần OCR tiền xử lý
Biết limits và khi nào dùng PDF vs chunking manually
Case studies: contract review, financial report analysis

Cách gửi PDF

Base64

URL

import base64

with open("earth.pdf", "rb") as f:
    file_bytes = base64.standard_b64encode(f.read()).decode("utf-8")

messages = [{
    "role": "user",
    "content": [
        {
            "type": "document",  # ← không phải "image"
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": file_bytes
            }
        },
        {
            "type": "text",
            "text": "Summarize this document in 3 bullet points."
        }
    ]
}]

response = client.messages.create(
    model="claude-sonnet-5-20260205",
    max_tokens=2000,
    messages=messages
)

URL

{
    "type": "document",
    "source": {
        "type": "url",
        "url": "https://example.com/paper.pdf"
    }
}

So với image

PDF block is special — Claude "page-through" document.

	Image block	Document (PDF) block
Type	"image"	"document"
Media type	image/png, image/jpg	application/pdf
Content	Single image	Multi-page document
Extraction	Visual content	Text + images + tables + structure

Limits

Quy tắc ngón cái

Max size: 32MB per PDF (check docs cập nhật)
Max pages: 100 pages
Token cost: Depends on content — text cheap, images expensive

Pages	Estimated tokens
1 text page	~500 tokens
10 page report	~5,000 tokens
100 page paper	~50,000 tokens
Pages với heavy images	2-3x

Claude extract được gì

Text

Full text, preserve structure (headers, lists, paragraphs).

Tables

Data + relationships:

Charts / graphs

Interpret visual. Ví dụ bar chart → Claude mô tả trend, values.

Images / diagrams

Describe picture embedded in PDF.

Footnotes, citations

Preserve cross-reference.

Q1 revenue: $100M (+15% YoY)
Q2 revenue: $120M (+20% YoY)

Ví dụ: Financial report analysis

Claude read 300-page 10-K, extract structured data.

with open("apple_10k_2025.pdf", "rb") as f:
    pdf_bytes = base64.standard_b64encode(f.read()).decode("utf-8")

messages = [{
    "role": "user",
    "content": [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": pdf_bytes
            }
        },
        {
            "type": "text",
            "text": """Analyze this 10-K report. Extract:

1. Total revenue (current year + YoY growth)
2. Top 3 business segments by revenue
3. Key risks mentioned (3 most important)
4. Cash position
5. R&D spending

Output as JSON."""
        }
    ]
}]

response = chat(messages)

Ví dụ: Contract review

Paralegals tiết kiệm 4-8 giờ/contract.

prompt = """Review this contract. For each clause, rate:
- Favorable / Neutral / Unfavorable (for our side as licensee)
- Suggest redline if needed

Focus on:
1. Payment terms
2. Termination
3. IP ownership
4. Liability limits
5. Confidentiality
"""

Ví dụ: Research paper summarization

Academic tool. Process 80 papers/literature review in 4 hours.

prompt = """Read this research paper. Output:

1. One-sentence tl;dr
2. Key contributions (3 bullets)
3. Methodology summary (4-6 sentences)
4. Limitations mentioned
5. Open questions for future research

Format: Markdown."""

Multi-PDF analysis

Powerful cho literature review, contract compare.

content = [
    {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf1}},
    {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf2}},
    {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf3}},
    {"type": "text", "text": "Compare methodologies across 3 papers. Highlight contradictions."}
]

When PDF direct vs chunking

PDF direct (Claude reads whole thing):

Manual chunking (RAG — Module 7):

Hybrid: Search (RAG) → identify top 3 docs → send each as PDF to Claude for deep analysis.

✅ Small-medium PDFs (<100 pages)
✅ Need understanding of structure
✅ Charts/images matter
✅ Large corpus (1000+ docs)
✅ Vector search across documents
✅ User queries need retrieval

Prompt caching cho PDF

PDF ~ 50K tokens. Same PDF, multiple queries → cache = 90% cost saving.

Second call với cùng PDF + khác question = read from cache. Details ở bài 6.47-6.49.

messages = [{
    "role": "user",
    "content": [
        {
            "type": "document",
            "source": {"type": "base64", ...},
            "cache_control": {"type": "ephemeral"}  # ← cache PDF
        },
        {"type": "text", "text": "First question..."}
    ]
}]

Anti-patterns

❌ Gửi scanned poor-quality PDF

PDF pages là low-res images → Claude hallucinate text.

Fix: Improve scan quality hoặc OCR tiền xử lý với good tool.

❌ PDF > 100 pages

Exceed limit → error.

Fix: Split PDF into chunks. Hoặc RAG approach.

❌ Không cache PDF cho repeated query

Same PDF, 10 questions → 10x cost.

Fix: Enable prompt caching.

❌ Prompt vague cho dense PDF

"Summarize" PDF 300 trang → generic summary.

Fix: Specific prompt: "Extract these 5 fields...", "Compare X vs Y...".

Áp dụng ngay

Bài tập 1: Analyze 1 PDF (20 phút)

Download PDF research paper (arxiv.org). Gửi Claude, yêu cầu:

So sánh output với read manually.

Bài tập 2: Multi-PDF compare (20 phút)

2 contract template (online templates free). Send both, ask "find differences".

Observe Claude analyze side-by-side.

Tl;dr 1 câu
3 key contributions
Methodology
Limitations

Tóm tắt

🎯 PDF block type "document" — Claude read first-class.

🎯 Max 100 pages, 32MB. Check limits cập nhật.

🎯 Extract everything: text, table, chart, image, structure.

🎯 Use for: contract, financial report, research paper, deep doc analysis.

🎯 RAG (Module 7) khi scale 1000+ docs. PDF direct cho <100 pages focused.

Nội dung này có hữu ích không?