Prompt caching in action — Production code

6 — Tính năng nâng caoNâng cao20 phút

Bạn sẽ học được
  • Wrap caching vào helper function (non-destructive)
  • Dùng tools + system caching cùng lúc
  • Monitor cache_creation vs cache_read tokens
  • Measure ROI thực tế

Pattern non-destructive

Không modify tools/system list gốc — risk bug khi reorder:

def chat_with_caching(messages, system=None, tools=None, **kwargs):
    params = {
        "model": MODEL,
        "max_tokens": 1000,
        "messages": messages,
    }
    
    # Cache tools (last tool in cloned list)
    if tools:
        tools_clone = tools.copy()
        last_tool = tools_clone[-1].copy()
        last_tool["cache_control"] = {"type": "ephemeral"}
        tools_clone[-1] = last_tool
        params["tools"] = tools_clone
    
    # Cache system prompt
    if system:
        params["system"] = [{
            "type": "text",
            "text": system,
            "cache_control": {"type": "ephemeral"}
        }]
    
    params.update(kwargs)
    return client.messages.create(**params)

Observe cache metrics

def log_cache_metrics(response, label=""):
    usage = response.usage
    print(f"\n--- {label} ---")
    print(f"Input (non-cached): {usage.input_tokens}")
    print(f"Cache write: {usage.cache_creation_input_tokens}")
    print(f"Cache read: {usage.cache_read_input_tokens}")
    print(f"Output: {usage.output_tokens}")
    
    total_input = (
        usage.input_tokens +
        usage.cache_creation_input_tokens +
        usage.cache_read_input_tokens
    )
    cached_ratio = usage.cache_read_input_tokens / total_input if total_input else 0
    print(f"Cache hit rate: {cached_ratio:.0%}")

Ví dụ end-to-end

Request 2 cost 80% lower than Request 1.

SYSTEM_PROMPT = """Bạn là HR Assistant chuyên nghiệp...
[3000 tokens of policy docs, role info, examples]
"""

tools = [
    get_employee_info_schema,
    create_leave_request_schema,
    search_policy_schema,  # last = cache breakpoint
]

# Request 1
messages = [{"role": "user", "content": "Tôi muốn xin nghỉ phép 1 tuần"}]
resp1 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
log_cache_metrics(resp1, "Request 1")

# Expected:
# Input (non-cached): 50 (user query)
# Cache write: 5000 (system + tools)
# Cache read: 0
# Hit rate: 0%

# Request 2
messages2 = [{"role": "user", "content": "Policy nghỉ ốm thế nào?"}]
resp2 = chat_with_caching(messages2, system=SYSTEM_PROMPT, tools=tools)
log_cache_metrics(resp2, "Request 2")

# Expected:
# Input (non-cached): 45
# Cache write: 0
# Cache read: 5000 (HIT!)
# Hit rate: 99%

Multi-turn with cached history

Trong 1 conversation, cache builds up:

For very long conversations, thêm breakpoint trên message history:

messages = []

# Turn 1: user
messages.append({"role": "user", "content": "First question"})
r1 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
messages.append({"role": "assistant", "content": r1.content})

# Turn 2: user
messages.append({"role": "user", "content": [
    {"type": "text", "text": "Second question"},
    # Don't add cache_control here — let system/tools cache be enough
]})

r2 = chat_with_caching(messages, system=SYSTEM_PROMPT, tools=tools)
# Reads cache for system+tools
# Processes turn 1 + turn 2 fresh

Multi-turn with cached history (tiếp)

# Cache breakpoint trên message thứ 10 (stable history)
if len(messages) > 10:
    # Add cache_control to 10th message
    ...

Dashboard: hit rate

Monitor over time. Target: > 70% hit rate cho high-frequency apps.

total_cache_reads = 0
total_cache_writes = 0
total_input = 0

def track_request(response):
    global total_cache_reads, total_cache_writes, total_input
    u = response.usage
    total_cache_reads += u.cache_read_input_tokens
    total_cache_writes += u.cache_creation_input_tokens
    total_input += u.input_tokens
    
    print(f"Running hit rate: {total_cache_reads / (total_cache_reads + total_cache_writes + total_input):.0%}")

Measuring ROI

Compare 2 runs:

Real number để decide deployment.

# Run without caching (baseline)
def run_without_cache(n_requests):
    total_cost = 0
    for _ in range(n_requests):
        r = client.messages.create(
            model=MODEL, max_tokens=500,
            system=SYSTEM_PROMPT,  # string, no cache
            messages=[{"role": "user", "content": random_question()}]
        )
        # Estimate cost: $3 per M input tokens, $15 per M output
        cost = (r.usage.input_tokens * 3 + r.usage.output_tokens * 15) / 1_000_000
        total_cost += cost
    return total_cost


# Run with caching
def run_with_cache(n_requests):
    total_cost = 0
    for _ in range(n_requests):
        r = chat_with_caching(
            [{"role": "user", "content": random_question()}],
            system=SYSTEM_PROMPT,
            tools=tools
        )
        cost = (
            r.usage.input_tokens * 3 +
            r.usage.cache_creation_input_tokens * 3.75 +  # 25% more
            r.usage.cache_read_input_tokens * 0.3 +      # 90% less
            r.usage.output_tokens * 15
        ) / 1_000_000
        total_cost += cost
    return total_cost


no_cache_cost = run_without_cache(100)
cache_cost = run_with_cache(100)
print(f"Without cache: ${no_cache_cost:.4f}")
print(f"With cache: ${cache_cost:.4f}")
print(f"Savings: {(1 - cache_cost/no_cache_cost)*100:.0f}%")

Partial cache hit

Scenario: thay đổi system prompt, giữ tools.

Result r2:

Ordering matter: Tools processed first, so they cache independently of system changes.

  • Cache read: tools only (~2K tokens)
  • Cache write: system_v2 (~500 tokens)
  • Partial savings
# Request 1
system_v1 = "You are assistant A."
r1 = chat_with_caching(msgs, system=system_v1, tools=tools)
# Cache write: system_v1 + tools

# Request 2 — system đổi
system_v2 = "You are assistant B."
r2 = chat_with_caching(msgs, system=system_v2, tools=tools)

When to invalidate

Sometimes bạn want cache miss:

Just change content → cache miss automatic. No explicit invalidate API.

  • Deploy new prompt version
  • A/B test different system

Anti-patterns revisited

❌ Monitor nothing

Enable cache, không biết work không.

Fix: Log metrics, dashboard hit rate.

❌ Cache dynamic per-user

system = f"User: {user_id}..." → per-user cache, low reuse.

Fix: Static system + per-user info in user message.

❌ First request slow alarms user

Cache write request = slower than no-cache version.

Fix: Warm cache in background at app startup. Or accept first request slower.

Áp dụng ngay

Bài tập 1: Apply production caching (30 phút)

Lấy app bạn build (Module 2). Add caching:

Test 10 requests. Log metrics. Calculate savings.

Bài tập 2: Cache warming (15 phút)

At app startup, send "warm-up" request với same system + tools + dummy user msg. Purpose: create cache.

Then real user requests benefit immediately.

  • System prompt với cache_control
  • Tools (nếu có) với cache_control on last

Tóm tắt

🎯 Non-destructive wrapper cho caching — clone before modify.

🎯 Monitor cache_creation vs cache_read tokens mỗi response.

🎯 Partial hit OK — tools cache, system cache independent.

🎯 Warm cache at startup giảm first-request latency.

🎯 Real ROI 60-90% savings cho high-frequency apps with stable system/tools.

Nội dung này có hữu ích không?
Kiểm tra kiến thức

Củng cố những gì bạn vừa học

12 câu trắc nghiệm · đạt từ 70% · câu hỏi và đáp án xáo trộn mỗi lần.