{"product_id":"prompt-caching-tối-ưu-chi-phi-claude-api","title":"Prompt Caching — Tối ưu chi phí Claude API","description":"\n\u003ch2\u003ePrompt Caching là gì?\u003c\/h2\u003e\n\u003cp\u003ePrompt Caching là tính năng của Claude API cho phép lưu trữ các phần tĩnh của prompt vào bộ nhớ cache, để các request tiếp theo không cần xử lý lại phần đó. Khi một phần prompt đã được cache, chi phí đọc từ cache chỉ bằng 10% so với xử lý từ đầu — đồng thời giảm đáng kể latency.\u003c\/p\u003e\n\n\u003cp\u003eĐây là tính năng đặc biệt hữu ích khi bạn có system prompt dài, nhiều tài liệu tham chiếu, hoặc few-shot examples lặp đi lặp lại trong mỗi request.\u003c\/p\u003e\n\n\u003ch2\u003eCách hoạt động\u003c\/h2\u003e\n\u003cp\u003eBạn đánh dấu các phần prompt cần cache bằng \u003ccode\u003ecache_control: { type: \"ephemeral\" }\u003c\/code\u003e. Claude sẽ tạo một \"breakpoint\" tại vị trí đó và cache tất cả nội dung từ đầu đến breakpoint.\u003c\/p\u003e\n\n\u003ch3\u003eCơ chế cache breakpoints\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eCache được tạo tại điểm \u003ccode\u003ecache_control\u003c\/code\u003e cuối cùng trong chuỗi prefix\u003c\/li\u003e\n  \u003cli\u003eMỗi request có thể có tối đa \u003cstrong\u003e4 cache breakpoints\u003c\/strong\u003e\n\u003c\/li\u003e\n  \u003cli\u003eCache có giá trị trong \u003cstrong\u003e5 phút\u003c\/strong\u003e kể từ lần đọc cuối cùng (TTL sliding window)\u003c\/li\u003e\n  \u003cli\u003eKích thước tối thiểu để cache: \u003cstrong\u003e1024 tokens\u003c\/strong\u003e (Haiku 3.5) hoặc \u003cstrong\u003e2048 tokens\u003c\/strong\u003e (Sonnet 4, Opus 4)\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003ePricing — Cache vs No Cache\u003c\/h2\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eLoại token\u003c\/th\u003e\n      \u003cth\u003eOpus 4\u003c\/th\u003e\n      \u003cth\u003eSonnet 4\u003c\/th\u003e\n      \u003cth\u003eHaiku 3.5\u003c\/th\u003e\n    \u003c\/tr\u003e\n  \u003c\/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eInput thông thường\u003c\/td\u003e\n      \u003ctd\u003e$15\/M\u003c\/td\u003e\n      \u003ctd\u003e$3\/M\u003c\/td\u003e\n      \u003ctd\u003e$0.80\/M\u003c\/td\u003e\n    \u003c\/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eCache write (lần đầu)\u003c\/td\u003e\n      \u003ctd\u003e$18.75\/M (+25%)\u003c\/td\u003e\n      \u003ctd\u003e$3.75\/M (+25%)\u003c\/td\u003e\n      \u003ctd\u003e$1\/M (+25%)\u003c\/td\u003e\n    \u003c\/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eCache read (lần sau)\u003c\/td\u003e\n      \u003ctd\u003e$1.50\/M (-90%)\u003c\/td\u003e\n      \u003ctd\u003e$0.30\/M (-90%)\u003c\/td\u003e\n      \u003ctd\u003e$0.08\/M (-90%)\u003c\/td\u003e\n    \u003c\/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eOutput\u003c\/td\u003e\n      \u003ctd\u003e$75\/M\u003c\/td\u003e\n      \u003ctd\u003e$15\/M\u003c\/td\u003e\n      \u003ctd\u003e$4\/M\u003c\/td\u003e\n    \u003c\/tr\u003e\n  \u003c\/tbody\u003e\n\u003c\/table\u003e\n\n\u003cp\u003eLần đầu cache sẽ tốn thêm 25% so với không cache. Từ lần thứ hai trở đi, bạn tiết kiệm 90%. Vì vậy cache có lợi khi nội dung được tái sử dụng ít nhất 2 lần trong vòng 5 phút.\u003c\/p\u003e\n\n\u003ch2\u003eNhững gì có thể cache\u003c\/h2\u003e\n\n\u003ch3\u003e1. System prompt dài\u003c\/h3\u003e\n\u003cp\u003eTrường hợp phổ biến nhất — system prompt chứa hướng dẫn chi tiết, persona, hoặc background context:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003eimport anthropic\n\nclient = anthropic.Anthropic()\n\n# System prompt 3000+ token — được cache\nsystem_prompt = \"\"\"\n[Toàn bộ hướng dẫn chi tiết cho AI assistant, quy tắc ứng xử,\nkiến thức domain cụ thể, cách xử lý các tình huống đặc biệt...\n— nội dung dài 2000-5000 tokens]\n\"\"\"\n\nresponse = client.messages.create(\n    model=\"claude-sonnet-4-5\",\n    max_tokens=1024,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": system_prompt,\n            \"cache_control\": {\"type\": \"ephemeral\"}\n        }\n    ],\n    messages=[{\"role\": \"user\", \"content\": user_question}]\n)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e2. Tool definitions\u003c\/h3\u003e\n\u003cp\u003eKhi có nhiều tools với descriptions dài, cache tool definitions tiết kiệm đáng kể:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003eresponse = client.messages.create(\n    model=\"claude-sonnet-4-5\",\n    max_tokens=1024,\n    tools=[\n        {\n            \"name\": \"search_database\",\n            \"description\": \"...\",  # Mô tả dài\n            \"input_schema\": {...}\n        },\n        # ... nhiều tools khác\n    ],\n    # Tool definitions được tự động cache nếu đủ dài\n    # Hoặc dùng cache_control trong tool definition\n    messages=[...]\n)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e3. Long context \/ tài liệu tham chiếu\u003c\/h3\u003e\n\u003cp\u003eCache tài liệu dài để hỏi nhiều câu hỏi về cùng một tài liệu:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003ewith open(\"legal_document.txt\", \"r\") as f:\n    document = f.read()\n\ndef ask_about_document(question):\n    return client.messages.create(\n        model=\"claude-opus-4\",\n        max_tokens=2048,\n        messages=[\n            {\n                \"role\": \"user\",\n                \"content\": [\n                    {\n                        \"type\": \"text\",\n                        \"text\": f\"Tài liệu:\n\n{document}\",\n                        \"cache_control\": {\"type\": \"ephemeral\"}\n                    },\n                    {\n                        \"type\": \"text\",\n                        \"text\": f\"\nCâu hỏi: {question}\"\n                    }\n                ]\n            }\n        ]\n    )\n\n# Câu hỏi 1: cache write (tốn thêm 25%)\nr1 = ask_about_document(\"Điều khoản thanh toán là gì?\")\n# Câu hỏi 2: cache read (tiết kiệm 90%)\nr2 = ask_about_document(\"Thời hạn hợp đồng bao lâu?\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e4. Few-shot examples\u003c\/h3\u003e\n\u003cp\u003eCache các ví dụ mẫu trong conversation history:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003emessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"text\",\n                \"text\": few_shot_examples,  # 10-20 ví dụ dài\n                \"cache_control\": {\"type\": \"ephemeral\"}\n            }\n        ]\n    },\n    {\"role\": \"assistant\", \"content\": \"Đã hiểu format. Sẵn sàng giúp bạn.\"},\n    {\"role\": \"user\", \"content\": actual_user_request}\n]\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eCache Hit\/Miss Headers\u003c\/h2\u003e\n\u003cp\u003eKiểm tra trạng thái cache qua usage object trong response:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003eresponse = client.messages.create(...)\n\nusage = response.usage\nprint(f\"Input tokens: {usage.input_tokens}\")\nprint(f\"Cache creation tokens: {usage.cache_creation_input_tokens}\")\nprint(f\"Cache read tokens: {usage.cache_read_input_tokens}\")\n\n# Tính chi phí thực tế\nif usage.cache_read_input_tokens \u0026gt; 0:\n    print(\"Cache HIT — đang đọc từ cache\")\nelif usage.cache_creation_input_tokens \u0026gt; 0:\n    print(\"Cache WRITE — đang tạo cache mới\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eMô hình tính toán chi phí tiết kiệm\u003c\/h2\u003e\n\u003cp\u003eVí dụ minh họa cho ứng dụng customer support với system prompt 5000 tokens và trung bình 100 requests\/giờ (Sonnet 4):\u003c\/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eKhông cache:\u003c\/strong\u003e 100 requests × 5000 tokens × $3\/M = $1.50\/giờ\u003c\/p\u003e\n\u003cp\u003e\u003cstrong\u003eCó cache\u003c\/strong\u003e (1 cache write + 99 cache reads):\u003c\/p\u003e\n\u003cul\u003e\n  \u003cli\u003eCache write: 5000 × $3.75\/M = $0.019\u003c\/li\u003e\n  \u003cli\u003eCache reads: 99 × 5000 × $0.30\/M = $0.149\u003c\/li\u003e\n  \u003cli\u003eTổng: ~$0.17\/giờ\u003c\/li\u003e\n\u003c\/ul\u003e\n\u003cp\u003eTiết kiệm hơn 88% chi phí input tokens cho system prompt.\u003c\/p\u003e\n\n\u003ch2\u003eBest Practices\u003c\/h2\u003e\n\n\u003ch3\u003eĐặt cache breakpoint đúng vị trí\u003c\/h3\u003e\n\u003cp\u003eCache breakpoint nên đặt ở cuối phần nội dung ổn định (không thay đổi). Nội dung thay đổi (user message, dynamic context) phải đặt SAU breakpoint:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e# ĐÚNG: Static content trước, dynamic content sau\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": STATIC_DOC, \"cache_control\": {\"type\": \"ephemeral\"}},\n            {\"type\": \"text\", \"text\": dynamic_user_question}  # Không cache\n        ]\n    }\n]\n\n# SAI: Dynamic content trước breakpoint sẽ làm vô hiệu hóa cache\n# (cache key bao gồm tất cả content trước breakpoint)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eTTL và warm-up\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eCache hết hạn sau 5 phút không được đọc\u003c\/li\u003e\n  \u003cli\u003eVới traffic thấp, cân nhắc gửi \"warm-up request\" định kỳ để giữ cache sống\u003c\/li\u003e\n  \u003cli\u003eCache không được persist qua API restarts — plan cho cache misses\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eKết hợp với Tool Use\u003c\/h3\u003e\n\u003cp\u003eCache system prompt + tool definitions, để user message và dynamic context không cache:\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003eresponse = client.messages.create(\n    model=\"claude-sonnet-4-5\",\n    max_tokens=1024,\n    system=[\n        {\"type\": \"text\", \"text\": static_instructions, \"cache_control\": {\"type\": \"ephemeral\"}}\n    ],\n    tools=large_tool_set,  # Cũng được cache tự động\n    messages=[{\"role\": \"user\", \"content\": user_input}]\n)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eLimitations\u003c\/h2\u003e\n\u003cul\u003e\n  \u003cli\u003eTTL chỉ 5 phút — không phù hợp cho cache giữa các phiên làm việc cách xa nhau\u003c\/li\u003e\n  \u003cli\u003eMinimum token threshold: 1024 (Haiku) hoặc 2048 (Sonnet\/Opus)\u003c\/li\u003e\n  \u003cli\u003eCache không share giữa các API key khác nhau\u003c\/li\u003e\n  \u003cli\u003eTối đa 4 breakpoints per request\u003c\/li\u003e\n  \u003cli\u003eCache content bị xóa hoàn toàn khi Anthropic deploy model updates\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eKết luận\u003c\/h2\u003e\n\u003cp\u003ePrompt Caching là một trong những tính năng tối ưu chi phí hiệu quả nhất của Claude API. Đối với bất kỳ ứng dụng nào có system prompt trên 2000 tokens hoặc dùng nhiều tài liệu tham chiếu, việc bật caching là gần như bắt buộc trong môi trường production. Chi phí lần đầu (cache write) tăng 25%, nhưng từ lần thứ hai trở đi tiết kiệm 90% — với traffic thực tế, ROI thường đạt được chỉ sau vài request đầu tiên.\u003c\/p\u003e\n","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47721071771860,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/prompt-caching-t_i-_u-chi-phi-claude-api.jpg?v=1774521677","url":"https:\/\/claude.vn\/products\/prompt-caching-t%e1%bb%91i-%c6%b0u-chi-phi-claude-api","provider":"CLAUDE.VN","version":"1.0","type":"link"}