Ở bài 6.4, bạn đã học: mỗi request Claude làm 4 giai đoạn — Tokenize → Embed → Contextualize → Generate.
- Hiểu cơ chế prompt caching: Claude store preprocessing work
- Tính ROI: 90% cost saving cho cached portion
- Biết khi nào caching work (repeated content, 1 giờ window)
- Không nhầm caching với "Claude nhớ conversation"
Cơ chế
Request 1 (no cache):
┌────────────────────────────┐
│ Tokenize (work) │
│ Embed (work) │
│ Contextualize (work) │
│ Generate (output) │
└────────────────────────────┘
After: all work DISCARDED
Request 1 (with cache):
┌────────────────────────────┐
│ Tokenize (work) ─┐ │
│ Embed (work) ─┼─▶ CACHE
│ Contextualize (work) ─┘ │
│ Generate (output) │
└────────────────────────────┘
Request 2 (cache hit):
┌────────────────────────────┐
│ LOAD from cache ──────▶ │
│ │
│ Generate (output) │
└────────────────────────────┘
80-90% faster + cheaperROI
Pricing
Cache write: 25% more expensive than regular (one-time cost). Cache read: 90% cheaper.
Break-even
Pay 25% more on write → save 90% on reads. Break-even after ~1.5 reads.
Reads > 2 → net savings.
Example
Prompt: 10K input tokens (system + context).
No caching:
With caching:
Trên scale, 20x save hàng tháng.
- 10 requests × 10K × $3/M = $0.30
- 1 cache write: 10K × $3.75/M = $0.0375
- 9 cache reads: 9 × 10K × $0.30/M = $0.027
- Total: $0.065 → 78% saving
| Regular | Cache write | Cache read | |
|---|---|---|---|
| Cost multiplier | 1x | 1.25x | 0.1x |
1-giờ cache window
Cache TTL = 1 hour. Sau 1 giờ, cache expire.
Hệ quả:
Minimum traffic để worth: ~ 1 request / 30 phút.
- Works cho high-frequency apps (traffic đều đặn)
- Không dùng cho low-traffic (cache expire trước request tiếp theo)
Khi nào work?
✅ Great fit
⚠️ Marginal
❌ Không work
- Long system prompt (1K+ tokens) reused across requests
- Large document (50K token PDF) với multiple questions
- Tool schemas stable qua requests (5-10 tools)
- Few-shot examples reused trong eval
- Long conversation với consistent context
- System prompt < 500 tokens (savings nhỏ)
- Low traffic (< 1 request/30 phút)
- Every request has unique long content
- Content changes mỗi request
- Single request (no follow-ups)
Caching != memory
Caching không có nghĩa là Claude nhớ conversation.
Cache invisible to Claude — model không biết nó đang reuse cache.
Multi-turn conversation: bạn vẫn gửi toàn bộ history mỗi request. Cache just makes it cheaper.
- Cache = preprocessing reuse, performance optimization
- Memory = remember facts across sessions, feature khác
Usage metrics
Response có usage fields:
Second request cùng prompt:
response.usage:
input_tokens: 50 # Non-cached input
output_tokens: 200
cache_creation_input_tokens: 5000 # Cache write (first time)
cache_read_input_tokens: 0Usage metrics (tiếp)
Monitor these để verify caching work.
input_tokens: 50
cache_creation_input_tokens: 0
cache_read_input_tokens: 5000 # ← Cache hit! 90% discountMinimum cache size
Content must be ≥ 1024 tokens để cache. Below threshold ignored.
Typical:
System prompt ngắn không cache được.
- 1024 token ≈ 700-800 words
- 2-3 page PDF
- 8-10 full paragraphs
Case study: Support chatbot
Scenario
Support chatbot:
Without caching
100 requests/hour × (3000 + 2000 + 100) × $3/M = $1.53/hour input
With caching
System + tools = 5000 tokens cache.
Total input: $0.198/hour → 87% saving.
Yearly
Without: $1.53 × 24 × 365 = $13,403 With: $0.198 × 24 × 365 = $1,735
Save ~$11,600/year trên 1 chatbot.
- System prompt: 3K tokens (policy docs embedded)
- Tool schemas: 2K tokens
- User query: 100 tokens
- Answer: 500 tokens
- 1 write: 5000 × $3.75/M = $0.019
- 99 reads: 99 × 5000 × $0.30/M = $0.149
- Plus 100 × 100 (non-cached user query) × $3/M = $0.030
Anti-patterns
❌ Cache mọi thứ
Content dynamic mỗi request → cache miss every time → lại tốn 25% write cost.
Fix: Only cache truly static content.
❌ Cache content < 1024 tokens
Ignored by API. False assumption savings.
Fix: Cache combined block reaching 1024+.
❌ Cache short-lived
1 request / 2 hours → cache expire. Each request = cache write.
Fix: Check traffic pattern before enable caching.
❌ Ignore metrics
Enable caching, không check cache_read_input_tokens. Don't know if working.
Fix: Log usage mỗi request. Dashboard cache hit rate.
Áp dụng ngay
Bài tập 1: Calculate ROI cho app của bạn (15 phút)
Bài tập 2: Identify cacheable content (10 phút)
List components of your request:
Mark cacheable (static ≥ 1024 tokens).
- System prompt: static/dynamic?
- Tool schemas: static?
- RAG context: changes per user?
- User query: always dynamic
- Typical prompt token count (system + context): ___
- Requests/hour: ___
- Current cost/hour: ___
With caching:
- Cache write cost: ___ × $3.75/M
- Cache read cost: (N-1) × ___ × $0.30/M
- Total with cache: ___
Saving: ___%
Monthly savings: $___Tóm tắt
🎯 Caching = preprocessing reuse, 90% cost discount on reads.
🎯 Pay 25% more on write, save 90% on reads. Break-even 1.5 reads.
🎯 1-hour window. High-frequency apps benefit most.
🎯 Min 1024 tokens. Short prompts không cache.
🎯 Metrics: cache_read_input_tokens verify caching work.