Prompt caching — Khái niệm & ROI — Building with the Claude API

Ở bài 6.4, bạn đã học: mỗi request Claude làm 4 giai đoạn — Tokenize → Embed → Contextualize → Generate.

Bạn sẽ học được

Hiểu cơ chế prompt caching: Claude store preprocessing work
Tính ROI: 90% cost saving cho cached portion
Biết khi nào caching work (repeated content, 1 giờ window)
Không nhầm caching với "Claude nhớ conversation"

Cơ chế

Request 1 (no cache):
  ┌────────────────────────────┐
  │ Tokenize       (work)      │
  │ Embed          (work)      │
  │ Contextualize  (work)      │
  │ Generate       (output)    │
  └────────────────────────────┘
     After: all work DISCARDED

Request 1 (with cache):
  ┌────────────────────────────┐
  │ Tokenize       (work) ─┐   │
  │ Embed          (work) ─┼─▶ CACHE
  │ Contextualize  (work) ─┘   │
  │ Generate       (output)    │
  └────────────────────────────┘

Request 2 (cache hit):
  ┌────────────────────────────┐
  │ LOAD from cache ──────▶    │
  │                             │
  │ Generate       (output)    │
  └────────────────────────────┘
     80-90% faster + cheaper

ROI

Pricing

Cache write: 25% more expensive than regular (one-time cost). Cache read: 90% cheaper.

Break-even

Pay 25% more on write → save 90% on reads. Break-even after ~1.5 reads.

Reads > 2 → net savings.

Example

Prompt: 10K input tokens (system + context).

No caching:

With caching:

Trên scale, 20x save hàng tháng.

10 requests × 10K × $3/M = $0.30
1 cache write: 10K × $3.75/M = $0.0375
9 cache reads: 9 × 10K × $0.30/M = $0.027
Total: $0.065 → 78% saving

	Regular	Cache write	Cache read
Cost multiplier	1x	1.25x	0.1x

1-giờ cache window

Cache TTL = 1 hour. Sau 1 giờ, cache expire.

Hệ quả:

Minimum traffic để worth: ~ 1 request / 30 phút.

Works cho high-frequency apps (traffic đều đặn)
Không dùng cho low-traffic (cache expire trước request tiếp theo)

Khi nào work?

✅ Great fit

⚠️ Marginal

❌ Không work

Long system prompt (1K+ tokens) reused across requests
Large document (50K token PDF) với multiple questions
Tool schemas stable qua requests (5-10 tools)
Few-shot examples reused trong eval
Long conversation với consistent context
System prompt < 500 tokens (savings nhỏ)
Low traffic (< 1 request/30 phút)
Every request has unique long content
Content changes mỗi request
Single request (no follow-ups)

Caching != memory

Caching không có nghĩa là Claude nhớ conversation.

Cache invisible to Claude — model không biết nó đang reuse cache.

Multi-turn conversation: bạn vẫn gửi toàn bộ history mỗi request. Cache just makes it cheaper.

Cache = preprocessing reuse, performance optimization
Memory = remember facts across sessions, feature khác

Usage metrics

Response có usage fields:

Second request cùng prompt:

response.usage:
  input_tokens: 50          # Non-cached input
  output_tokens: 200
  cache_creation_input_tokens: 5000  # Cache write (first time)
  cache_read_input_tokens: 0

Usage metrics (tiếp)

Monitor these để verify caching work.

  input_tokens: 50
  cache_creation_input_tokens: 0
  cache_read_input_tokens: 5000  # ← Cache hit! 90% discount

Minimum cache size

Content must be ≥ 1024 tokens để cache. Below threshold ignored.

Typical:

System prompt ngắn không cache được.

1024 token ≈ 700-800 words
2-3 page PDF
8-10 full paragraphs

Case study: Support chatbot

Scenario

Support chatbot:

Without caching

100 requests/hour × (3000 + 2000 + 100) × $3/M = $1.53/hour input

With caching

System + tools = 5000 tokens cache.

Total input: $0.198/hour → 87% saving.

Yearly

Without: $1.53 × 24 × 365 = $13,403 With: $0.198 × 24 × 365 = $1,735

Save ~$11,600/year trên 1 chatbot.

System prompt: 3K tokens (policy docs embedded)
Tool schemas: 2K tokens
User query: 100 tokens
Answer: 500 tokens
1 write: 5000 × $3.75/M = $0.019
99 reads: 99 × 5000 × $0.30/M = $0.149
Plus 100 × 100 (non-cached user query) × $3/M = $0.030

Anti-patterns

❌ Cache mọi thứ

Content dynamic mỗi request → cache miss every time → lại tốn 25% write cost.

Fix: Only cache truly static content.

❌ Cache content < 1024 tokens

Ignored by API. False assumption savings.

Fix: Cache combined block reaching 1024+.

❌ Cache short-lived

1 request / 2 hours → cache expire. Each request = cache write.

Fix: Check traffic pattern before enable caching.

❌ Ignore metrics

Enable caching, không check cache_read_input_tokens. Don't know if working.

Fix: Log usage mỗi request. Dashboard cache hit rate.

Áp dụng ngay

Bài tập 1: Calculate ROI cho app của bạn (15 phút)

Bài tập 2: Identify cacheable content (10 phút)

List components of your request:

Mark cacheable (static ≥ 1024 tokens).

System prompt: static/dynamic?
Tool schemas: static?
RAG context: changes per user?
User query: always dynamic

- Typical prompt token count (system + context): ___
- Requests/hour: ___
- Current cost/hour: ___

With caching:
- Cache write cost: ___ × $3.75/M
- Cache read cost: (N-1) × ___ × $0.30/M
- Total with cache: ___

Saving: ___% 
Monthly savings: $___

Tóm tắt

🎯 Caching = preprocessing reuse, 90% cost discount on reads.

🎯 Pay 25% more on write, save 90% on reads. Break-even 1.5 reads.

🎯 1-hour window. High-frequency apps benefit most.

🎯 Min 1024 tokens. Short prompts không cache.

🎯 Metrics: cache_read_input_tokens verify caching work.

Nội dung này có hữu ích không?