- Wrap caching vào helper function (non-destructive)
- Dùng tools + system caching cùng lúc
- Monitor cache_creation vs cache_read tokens
- Measure ROI thực tế
Pattern non-destructive
Không modify tools/system list gốc — risk bug khi reorder:
def chat_with_caching(messages, system=None, tools=None, **kwargs):
params = {
"model": MODEL,
"max_tokens": 1000,
"messages": messages,
}
# Cache tools (last tool in cloned list)
if tools:
tools_clone = tools.copy()
last_tool = tools_clone[-1].copy()
last_tool["cache_control"] = {"type": "ephemeral"}
tools_clone[-1] = last_tool
params["tools"] = tools_clone
# Cache system prompt
if system:
params["system"] = [{
"type": "text",
"text": system,
"cache_control": {"type": "ephemeral"}
}]
params.update(kwargs)
return client.messages.create(**params)Observe cache metrics
def log_cache_metrics(response, label=""):
usage = response.usage
print(f"\n--- {label} ---")
print(f"Input (non-cached): {usage.input_tokens}")
print(f"Cache write: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
print(f"Output: {usage.output_tokens}")
total_input = (
usage.input_tokens +
usage.cache_creation_input_tokens +
usage.cache_read_input_tokens
)
cached_ratio = usage.cache_read_input_tokens / total_input if total_input else 0
print(f"Cache hit rate: {cached_ratio:.0%}")Ví dụ end-to-end
Request 2 cost 80% lower than Request 1.
SYSTEM_PROMPT = """Bạn là HR Assistant chuyên nghiệp...
[3000 tokens of policy docs, role info, examples]
"""
tools = [
get_employee_info_schema,
create_leave_request_schema,
search_policy_schema, # last = cache breakpoint
]
# Request 1
messages = [{"role": "user", "content": "Tôi muốn xin nghỉ phép 1 tuần"}]
resp1 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
log_cache_metrics(resp1, "Request 1")
# Expected:
# Input (non-cached): 50 (user query)
# Cache write: 5000 (system + tools)
# Cache read: 0
# Hit rate: 0%
# Request 2
messages2 = [{"role": "user", "content": "Policy nghỉ ốm thế nào?"}]
resp2 = chat_with_caching(messages2, system=SYSTEM_PROMPT, tools=tools)
log_cache_metrics(resp2, "Request 2")
# Expected:
# Input (non-cached): 45
# Cache write: 0
# Cache read: 5000 (HIT!)
# Hit rate: 99%Multi-turn with cached history
Trong 1 conversation, cache builds up:
For very long conversations, thêm breakpoint trên message history:
messages = []
# Turn 1: user
messages.append({"role": "user", "content": "First question"})
r1 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
messages.append({"role": "assistant", "content": r1.content})
# Turn 2: user
messages.append({"role": "user", "content": [
{"type": "text", "text": "Second question"},
# Don't add cache_control here — let system/tools cache be enough
]})
r2 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
# Reads cache for system+tools
# Processes turn 1 + turn 2 freshMulti-turn with cached history (tiếp)
# Cache breakpoint trên message thứ 10 (stable history)
if len(messages) > 10:
# Add cache_control to 10th message
...Dashboard: hit rate
Monitor over time. Target: > 70% hit rate cho high-frequency apps.
total_cache_reads = 0
total_cache_writes = 0
total_input = 0
def track_request(response):
global total_cache_reads, total_cache_writes, total_input
u = response.usage
total_cache_reads += u.cache_read_input_tokens
total_cache_writes += u.cache_creation_input_tokens
total_input += u.input_tokens
print(f"Running hit rate: {total_cache_reads / (total_cache_reads + total_cache_writes + total_input):.0%}")Measuring ROI
Compare 2 runs:
Real number để decide deployment.
# Run without caching (baseline)
def run_without_cache(n_requests):
total_cost = 0
for _ in range(n_requests):
r = client.messages.create(
model=MODEL, max_tokens=500,
system=SYSTEM_PROMPT, # string, no cache
messages=[{"role": "user", "content": random_question()}]
)
# Estimate cost: $3 per M input tokens, $15 per M output
cost = (r.usage.input_tokens * 3 + r.usage.output_tokens * 15) / 1_000_000
total_cost += cost
return total_cost
# Run with caching
def run_with_cache(n_requests):
total_cost = 0
for _ in range(n_requests):
r = chat_with_caching(
[{"role": "user", "content": random_question()}],
system=SYSTEM_PROMPT,
tools=tools
)
cost = (
r.usage.input_tokens * 3 +
r.usage.cache_creation_input_tokens * 3.75 + # 25% more
r.usage.cache_read_input_tokens * 0.3 + # 90% less
r.usage.output_tokens * 15
) / 1_000_000
total_cost += cost
return total_cost
no_cache_cost = run_without_cache(100)
cache_cost = run_with_cache(100)
print(f"Without cache: ${no_cache_cost:.4f}")
print(f"With cache: ${cache_cost:.4f}")
print(f"Savings: {(1 - cache_cost/no_cache_cost)*100:.0f}%")Partial cache hit
Scenario: thay đổi system prompt, giữ tools.
Result r2:
Ordering matter: Tools processed first, so they cache independently of system changes.
- Cache read: tools only (~2K tokens)
- Cache write: system_v2 (~500 tokens)
- Partial savings
# Request 1
system_v1 = "You are assistant A."
r1 = chat_with_caching(msgs, system=system_v1, tools=tools)
# Cache write: system_v1 + tools
# Request 2 — system đổi
system_v2 = "You are assistant B."
r2 = chat_with_caching(msgs, system=system_v2, tools=tools)When to invalidate
Sometimes bạn want cache miss:
Just change content → cache miss automatic. No explicit invalidate API.
- Deploy new prompt version
- A/B test different system
Anti-patterns revisited
❌ Monitor nothing
Enable cache, không biết work không.
Fix: Log metrics, dashboard hit rate.
❌ Cache dynamic per-user
system = f"User: {user_id}..." → per-user cache, low reuse.
Fix: Static system + per-user info in user message.
❌ First request slow alarms user
Cache write request = slower than no-cache version.
Fix: Warm cache in background at app startup. Or accept first request slower.
Áp dụng ngay
Bài tập 1: Apply production caching (30 phút)
Lấy app bạn build (Module 2). Add caching:
Test 10 requests. Log metrics. Calculate savings.
Bài tập 2: Cache warming (15 phút)
At app startup, send "warm-up" request với same system + tools + dummy user msg. Purpose: create cache.
Then real user requests benefit immediately.
- System prompt với cache_control
- Tools (nếu có) với cache_control on last
Tóm tắt
🎯 Non-destructive wrapper cho caching — clone before modify.
🎯 Monitor cache_creation vs cache_read tokens mỗi response.
🎯 Partial hit OK — tools cache, system cache independent.
🎯 Warm cache at startup giảm first-request latency.
🎯 Real ROI 60-90% savings cho high-frequency apps with stable system/tools.
Củng cố những gì bạn vừa học
12 câu trắc nghiệm · đạt từ 70% · câu hỏi và đáp án xáo trộn mỗi lần.