{"product_id":"prompt-caching-tiết-kiệm-90-chi-phi-với-cache-thong-minh","title":"Prompt Caching — Tiết kiệm 90% chi phí với cache thông minh","description":"\n\u003cp\u003eBạn đang build ứng dụng với system prompt dài 5.000 token, hoặc đưa cả cuốn sách vào context? Mỗi lần gọi API, bạn phải trả tiền cho tất cả tokens đó — dù Claude đã \"đọc\" chúng hàng trăm lần trước. \u003cstrong\u003ePrompt Caching\u003c\/strong\u003e giải quyết vấn đề này: Anthropic cache phần context tốn kém, bạn chỉ trả 10% chi phí cho cache hits.\u003c\/p\u003e\n\n\u003ch2\u003eCơ chế hoạt động\u003c\/h2\u003e\n\n\u003cp\u003eKhi bật caching, Claude lưu trữ prefix của prompt (thường là system prompt và context dài) trên server. Các request tiếp theo reuse cache thay vì reprocess toàn bộ.\u003c\/p\u003e\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n\u003cth\u003eLoại token\u003c\/th\u003e\n\u003cth\u003eGiá so với standard\u003c\/th\u003e\n\u003c\/tr\u003e\n  \u003c\/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n\u003ctd\u003eCache write (lần đầu)\u003c\/td\u003e\n\u003ctd\u003e125% (tốn thêm 25%)\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eCache read (các lần sau)\u003c\/td\u003e\n\u003ctd\u003e10% (tiết kiệm 90%)\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eOutput tokens\u003c\/td\u003e\n\u003ctd\u003e100% (không đổi)\u003c\/td\u003e\n\u003c\/tr\u003e\n  \u003c\/tbody\u003e\n\u003c\/table\u003e\n\n\u003cp\u003eCache tồn tại trong \u003cstrong\u003e5 phút\u003c\/strong\u003e (có thể gia hạn bằng cách access). Phù hợp nhất cho:\u003c\/p\u003e\n\u003cul\u003e\n  \u003cli\u003eSystem prompts dài (hàng nghìn tokens)\u003c\/li\u003e\n  \u003cli\u003eTài liệu tham khảo lớn (FAQ, documentation, sách)\u003c\/li\u003e\n  \u003cli\u003eFew-shot examples phức tạp\u003c\/li\u003e\n  \u003cli\u003eMulti-turn conversations với context dài\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eDemo với Pride and Prejudice (187k tokens)\u003c\/h2\u003e\n\n\u003cp\u003eĐây là benchmark từ Anthropic Cookbook chính thức — đưa toàn bộ cuốn sách vào context và đặt câu hỏi:\u003c\/p\u003e\n\n\u003cpre\u003e\u003ccode\u003eimport anthropic\nimport time\n\nclient = anthropic.Anthropic()\n\n# Load toàn bộ cuốn sách (tải từ Project Gutenberg)\nwith open(\"pride_and_prejudice.txt\", \"r\", encoding=\"utf-8\") as f:\n    book_text = f.read()\n\nprint(f\"Book length: {len(book_text):,} characters\")\n\ndef test_no_cache(question: str):\n    \"\"\"Request không có cache.\"\"\"\n    start = time.time()\n\n    response = client.messages.create(\n        model=\"claude-haiku-4-5\",\n        max_tokens=300,\n        system=f\"You are a literary expert. Here is the full text of Pride and Prejudice:\n\n{book_text}\",\n        messages=[{\"role\": \"user\", \"content\": question}],\n    )\n\n    elapsed = time.time() - start\n    return {\n        \"answer\": response.content[0].text,\n        \"time\": elapsed,\n        \"input_tokens\": response.usage.input_tokens,\n        \"output_tokens\": response.usage.output_tokens,\n        \"cache_read\": 0,\n        \"cache_write\": 0,\n    }\n\ndef test_with_cache(question: str):\n    \"\"\"Request với explicit cache breakpoint.\"\"\"\n    start = time.time()\n\n    response = client.messages.create(\n        model=\"claude-haiku-4-5\",\n        max_tokens=300,\n        system=[\n            {\n                \"type\": \"text\",\n                \"text\": \"You are a literary expert. Here is the full text of Pride and Prejudice:\",\n            },\n            {\n                \"type\": \"text\",\n                \"text\": book_text,\n                # Đặt cache breakpoint tại đây\n                \"cache_control\": {\"type\": \"ephemeral\"},\n            },\n        ],\n        messages=[{\"role\": \"user\", \"content\": question}],\n    )\n\n    elapsed = time.time() - start\n    usage = response.usage\n\n    return {\n        \"answer\": response.content[0].text,\n        \"time\": elapsed,\n        \"input_tokens\": usage.input_tokens,\n        \"output_tokens\": usage.output_tokens,\n        \"cache_read\": getattr(usage, \"cache_read_input_tokens\", 0),\n        \"cache_write\": getattr(usage, \"cache_creation_input_tokens\", 0),\n    }\n\n# Benchmark\nquestion = \"Mô tả tính cách của Elizabeth Bennet và mối quan hệ của cô với Mr. Darcy.\"\n\nprint(\"=== Test 1: Không cache ===\")\nr1 = test_no_cache(question)\nprint(f\"Time: {r1['time']:.2f}s | Input tokens: {r1['input_tokens']:,}\")\nprint(f\"Answer: {r1['answer'][:100]}...\")\n\nprint(\"\n=== Test 2: Cache Write (lần đầu) ===\")\nr2 = test_with_cache(question)\nprint(f\"Time: {r2['time']:.2f}s | Cache write: {r2['cache_write']:,} | Input: {r2['input_tokens']:,}\")\n\nprint(\"\n=== Test 3: Cache Hit (lần tiếp) ===\")\nr3 = test_with_cache(\"Mr. Darcy thay đổi như thế nào trong suốt câu chuyện?\")\nprint(f\"Time: {r3['time']:.2f}s | Cache read: {r3['cache_read']:,} | Input: {r3['input_tokens']:,}\")\n\n# So sánh\nspeedup = r1['time'] \/ r3['time']\ncost_reduction = 1 - (r3['input_tokens'] \/ r1['input_tokens'])\nprint(f\"\nKết quả: {speedup:.1f}x nhanh hơn, giảm {cost_reduction:.0%} chi phí input tokens\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eKết quả thực tế từ benchmark\u003c\/h2\u003e\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n\u003cth\u003eScenario\u003c\/th\u003e\n\u003cth\u003eThời gian\u003c\/th\u003e\n\u003cth\u003eInput tokens\u003c\/th\u003e\n\u003cth\u003eChi phí (tương đối)\u003c\/th\u003e\n\u003c\/tr\u003e\n  \u003c\/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n\u003ctd\u003eKhông cache\u003c\/td\u003e\n\u003ctd\u003e~18s\u003c\/td\u003e\n\u003ctd\u003e187,000\u003c\/td\u003e\n\u003ctd\u003e100%\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eCache write (lần 1)\u003c\/td\u003e\n\u003ctd\u003e~20s\u003c\/td\u003e\n\u003ctd\u003e187,000 (+ write cost)\u003c\/td\u003e\n\u003ctd\u003e125%\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eCache hit (lần 2+)\u003c\/td\u003e\n\u003ctd\u003e~8s\u003c\/td\u003e\n\u003ctd\u003e~500 (chỉ câu hỏi)\u003c\/td\u003e\n\u003ctd\u003e~11%\u003c\/td\u003e\n\u003c\/tr\u003e\n  \u003c\/tbody\u003e\n\u003c\/table\u003e\n\n\u003cp\u003eTừ lần thứ 3 trở đi, mỗi request chỉ tốn ~11% so với không cache. Break-even point: chỉ cần \u003cstrong\u003e2 requests\u003c\/strong\u003e là đã có lời!\u003c\/p\u003e\n\n\u003ch2\u003eMulti-turn Conversation Caching\u003c\/h2\u003e\n\n\u003cp\u003eĐây là ứng dụng thực tế nhất — cache lịch sử conversation:\u003c\/p\u003e\n\n\u003cpre\u003e\u003ccode\u003edef chat_with_caching(conversation_history: list, new_message: str,\n                      system_prompt: str = \"\") -\u0026gt; dict:\n    \"\"\"\n    Multi-turn chat với caching trên conversation history.\n    Cache toàn bộ history, chỉ gửi message mới.\n    \"\"\"\n\n    # Build messages với cache trên lịch sử\n    messages = []\n\n    for i, msg in enumerate(conversation_history):\n        if i == len(conversation_history) - 1:\n            # Đặt cache breakpoint ở message cuối cùng trong history\n            messages.append({\n                \"role\": msg[\"role\"],\n                \"content\": [\n                    {\n                        \"type\": \"text\",\n                        \"text\": msg[\"content\"],\n                        \"cache_control\": {\"type\": \"ephemeral\"},\n                    }\n                ]\n            })\n        else:\n            messages.append(msg)\n\n    # Thêm message mới\n    messages.append({\"role\": \"user\", \"content\": new_message})\n\n    response = client.messages.create(\n        model=\"claude-haiku-4-5\",\n        max_tokens=500,\n        system=system_prompt,\n        messages=messages,\n    )\n\n    usage = response.usage\n    answer = response.content[0].text\n\n    # Cập nhật lịch sử\n    conversation_history.append({\"role\": \"user\", \"content\": new_message})\n    conversation_history.append({\"role\": \"assistant\", \"content\": answer})\n\n    return {\n        \"answer\": answer,\n        \"cache_read\": getattr(usage, \"cache_read_input_tokens\", 0),\n        \"cache_write\": getattr(usage, \"cache_creation_input_tokens\", 0),\n        \"regular_input\": usage.input_tokens,\n    }\n\n# Ví dụ: Chatbot phân tích tài liệu\nsystem = \"Bạn là trợ lý phân tích tài liệu. Trả lời ngắn gọn, chính xác.\"\nhistory = []\n\n# Turn 1\nr = chat_with_caching(history, \"Xin chào, tôi cần phân tích hợp đồng.\", system)\nprint(f\"Turn 1 | Cache read: {r['cache_read']} | Answer: {r['answer'][:60]}...\")\n\n# Turn 2 (cache history từ turn 1)\nr = chat_with_caching(history, \"Các điều khoản nào cần chú ý nhất?\", system)\nprint(f\"Turn 2 | Cache read: {r['cache_read']} | Answer: {r['answer'][:60]}...\")\n\n# Turn 3 (cache history từ turn 1+2)\nr = chat_with_caching(history, \"Rủi ro pháp lý là gì?\", system)\nprint(f\"Turn 3 | Cache read: {r['cache_read']} | Answer: {r['answer'][:60]}...\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eAutomatic Caching (claude-3-7 và mới hơn)\u003c\/h2\u003e\n\n\u003cp\u003eVới các model mới nhất, bạn có thể bật \u003cstrong\u003eautomatic caching\u003c\/strong\u003e mà không cần đặt cache_control thủ công:\u003c\/p\u003e\n\n\u003cpre\u003e\u003ccode\u003e# Automatic caching — Anthropic tự quyết định cache gì\nresponse = client.messages.create(\n    model=\"claude-opus-4-5\",\n    max_tokens=1000,\n    # Không cần cache_control — model tự xử lý\n    system=\"System prompt rất dài của bạn...\",\n    messages=[{\"role\": \"user\", \"content\": \"Câu hỏi ngắn\"}],\n    # Bật extended thinking để tận dụng tối đa caching\n)\n\n# Kiểm tra cache được dùng không\nusage = response.usage\nif hasattr(usage, \"cache_read_input_tokens\") and usage.cache_read_input_tokens \u0026gt; 0:\n    print(f\"Cache hit! Đọc {usage.cache_read_input_tokens:,} tokens từ cache\")\nelse:\n    print(\"Cache miss (hoặc lần đầu)\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eChiến lược tối ưu caching\u003c\/h2\u003e\n\n\u003ch3\u003e1. Đặt cache breakpoint đúng chỗ\u003c\/h3\u003e\n\u003cp\u003eCache breakpoint nên đặt sau phần ít thay đổi nhất:\u003c\/p\u003e\n\n\u003cpre\u003e\u003ccode\u003e# TỐT: Cache system prompt + context tĩnh\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"SYSTEM: \" + static_instructions},\n            {\"type\": \"text\", \"text\": \"CONTEXT: \" + large_document,\n             \"cache_control\": {\"type\": \"ephemeral\"}},  # Breakpoint ở đây\n            {\"type\": \"text\", \"text\": \"QUESTION: \" + dynamic_user_question},\n        ]\n    }\n]\n\n# KHÔNG TỐT: Cache ở giữa phần thay đổi\n# Cache sẽ miss mỗi lần vì content sau breakpoint thay đổi\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e2. Batch requests trong 5 phút\u003c\/h3\u003e\n\n\u003cpre\u003e\u003ccode\u003eimport threading\n\nclass CachingBatcher:\n    \"\"\"Gom nhiều requests trong window 5 phút để tối đa hóa cache hits.\"\"\"\n\n    def __init__(self, shared_context: str, window_seconds: int = 240):\n        self.context = shared_context\n        self.window = window_seconds\n        self._last_request = None\n        self._lock = threading.Lock()\n\n    def should_refresh_cache(self) -\u0026gt; bool:\n        if self._last_request is None:\n            return True\n        return (time.time() - self._last_request) \u0026gt; self.window\n\n    def query(self, question: str) -\u0026gt; str:\n        with self._lock:\n            self._last_request = time.time()\n\n        response = client.messages.create(\n            model=\"claude-haiku-4-5\",\n            max_tokens=500,\n            messages=[{\n                \"role\": \"user\",\n                \"content\": [\n                    {\n                        \"type\": \"text\",\n                        \"text\": self.context,\n                        \"cache_control\": {\"type\": \"ephemeral\"},\n                    },\n                    {\"type\": \"text\", \"text\": question},\n                ]\n            }],\n        )\n        return response.content[0].text\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eTính ROI của Prompt Caching\u003c\/h2\u003e\n\n\u003cp\u003eVí dụ thực tế: Chatbot với system prompt 10.000 tokens, 1.000 users\/ngày, mỗi user hỏi 5 câu:\u003c\/p\u003e\n\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eKhông cache:\u003c\/strong\u003e 1.000 × 5 × 10.000 = 50M tokens\/ngày\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eVới cache:\u003c\/strong\u003e 1.000 × 10.000 (write) + 4.000 × 1.000 (read, 10%) = 10M + 4M = 14M tokens hiệu quả\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eTiết kiệm: ~72%\u003c\/strong\u003e chi phí input tokens từ ngày thứ 2 trở đi\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003cp\u003ePrompt Caching là một trong những optimizations ROI cao nhất bạn có thể làm cho ứng dụng Claude. Kết hợp với \u003ca href=\"\/collections\/nang-cao\"\u003eBatch Processing\u003c\/a\u003e để tối ưu hoàn toàn chi phí vận hành.\u003c\/p\u003e\n","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47721832251604,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/prompt-caching-ti_t-ki_m-90-chi-phi-v_i-cache-thong-minh.jpg?v=1774521674","url":"https:\/\/claude.vn\/products\/prompt-caching-ti%e1%ba%bft-ki%e1%bb%87m-90-chi-phi-v%e1%bb%9bi-cache-thong-minh","provider":"CLAUDE.VN","version":"1.0","type":"link"}