Prompt caching — Khái niệm & ROI

6 — Tính năng nâng caoTrung cấp20 phút

Ở bài 6.4, bạn đã học: mỗi request Claude làm 4 giai đoạn — Tokenize → Embed → Contextualize → Generate.

Bạn sẽ học được
  • Hiểu cơ chế prompt caching: Claude store preprocessing work
  • Tính ROI: 90% cost saving cho cached portion
  • Biết khi nào caching work (repeated content, 1 giờ window)
  • Không nhầm caching với "Claude nhớ conversation"

Cơ chế

Request 1 (no cache):
  ┌────────────────────────────┐
  │ Tokenize       (work)      │
  │ Embed          (work)      │
  │ Contextualize  (work)      │
  │ Generate       (output)    │
  └────────────────────────────┘
     After: all work DISCARDED

Request 1 (with cache):
  ┌────────────────────────────┐
  │ Tokenize       (work) ─┐   │
  │ Embed          (work) ─┼─▶ CACHE
  │ Contextualize  (work) ─┘   │
  │ Generate       (output)    │
  └────────────────────────────┘

Request 2 (cache hit):
  ┌────────────────────────────┐
  │ LOAD from cache ──────▶    │
  │                             │
  │ Generate       (output)    │
  └────────────────────────────┘
     80-90% faster + cheaper

ROI

Pricing

Cache write: 25% more expensive than regular (one-time cost). Cache read: 90% cheaper.

Break-even

Pay 25% more on write → save 90% on reads. Break-even after ~1.5 reads.

Reads > 2 → net savings.

Example

Prompt: 10K input tokens (system + context).

No caching:

With caching:

Trên scale, 20x save hàng tháng.

  • 10 requests × 10K × $3/M = $0.30
  • 1 cache write: 10K × $3.75/M = $0.0375
  • 9 cache reads: 9 × 10K × $0.30/M = $0.027
  • Total: $0.065 → 78% saving
RegularCache writeCache read
Cost multiplier1x1.25x0.1x

1-giờ cache window

Cache TTL = 1 hour. Sau 1 giờ, cache expire.

Hệ quả:

Minimum traffic để worth: ~ 1 request / 30 phút.

  • Works cho high-frequency apps (traffic đều đặn)
  • Không dùng cho low-traffic (cache expire trước request tiếp theo)

Khi nào work?

✅ Great fit

⚠️ Marginal

❌ Không work

  • Long system prompt (1K+ tokens) reused across requests
  • Large document (50K token PDF) với multiple questions
  • Tool schemas stable qua requests (5-10 tools)
  • Few-shot examples reused trong eval
  • Long conversation với consistent context
  • System prompt < 500 tokens (savings nhỏ)
  • Low traffic (< 1 request/30 phút)
  • Every request has unique long content
  • Content changes mỗi request
  • Single request (no follow-ups)

Caching != memory

Caching không có nghĩa là Claude nhớ conversation.

Cache invisible to Claude — model không biết nó đang reuse cache.

Multi-turn conversation: bạn vẫn gửi toàn bộ history mỗi request. Cache just makes it cheaper.

  • Cache = preprocessing reuse, performance optimization
  • Memory = remember facts across sessions, feature khác

Usage metrics

Response có usage fields:

Second request cùng prompt:

response.usage:
  input_tokens: 50          # Non-cached input
  output_tokens: 200
  cache_creation_input_tokens: 5000  # Cache write (first time)
  cache_read_input_tokens: 0

Usage metrics (tiếp)

Monitor these để verify caching work.

  input_tokens: 50
  cache_creation_input_tokens: 0
  cache_read_input_tokens: 5000  # ← Cache hit! 90% discount

Minimum cache size

Content must be ≥ 1024 tokens để cache. Below threshold ignored.

Typical:

System prompt ngắn không cache được.

  • 1024 token ≈ 700-800 words
  • 2-3 page PDF
  • 8-10 full paragraphs

Case study: Support chatbot

Scenario

Support chatbot:

Without caching

100 requests/hour × (3000 + 2000 + 100) × $3/M = $1.53/hour input

With caching

System + tools = 5000 tokens cache.

Total input: $0.198/hour → 87% saving.

Yearly

Without: $1.53 × 24 × 365 = $13,403 With: $0.198 × 24 × 365 = $1,735

Save ~$11,600/year trên 1 chatbot.

  • System prompt: 3K tokens (policy docs embedded)
  • Tool schemas: 2K tokens
  • User query: 100 tokens
  • Answer: 500 tokens
  • 1 write: 5000 × $3.75/M = $0.019
  • 99 reads: 99 × 5000 × $0.30/M = $0.149
  • Plus 100 × 100 (non-cached user query) × $3/M = $0.030

Anti-patterns

❌ Cache mọi thứ

Content dynamic mỗi request → cache miss every time → lại tốn 25% write cost.

Fix: Only cache truly static content.

❌ Cache content < 1024 tokens

Ignored by API. False assumption savings.

Fix: Cache combined block reaching 1024+.

❌ Cache short-lived

1 request / 2 hours → cache expire. Each request = cache write.

Fix: Check traffic pattern before enable caching.

❌ Ignore metrics

Enable caching, không check cache_read_input_tokens. Don't know if working.

Fix: Log usage mỗi request. Dashboard cache hit rate.

Áp dụng ngay

Bài tập 1: Calculate ROI cho app của bạn (15 phút)

Bài tập 2: Identify cacheable content (10 phút)

List components of your request:

Mark cacheable (static ≥ 1024 tokens).

  • System prompt: static/dynamic?
  • Tool schemas: static?
  • RAG context: changes per user?
  • User query: always dynamic
- Typical prompt token count (system + context): ___
- Requests/hour: ___
- Current cost/hour: ___

With caching:
- Cache write cost: ___ × $3.75/M
- Cache read cost: (N-1) × ___ × $0.30/M
- Total with cache: ___

Saving: ___% 
Monthly savings: $___

Tóm tắt

🎯 Caching = preprocessing reuse, 90% cost discount on reads.

🎯 Pay 25% more on write, save 90% on reads. Break-even 1.5 reads.

🎯 1-hour window. High-frequency apps benefit most.

🎯 Min 1024 tokens. Short prompts không cache.

🎯 Metrics: cache_read_input_tokens verify caching work.

Nội dung này có hữu ích không?