{"product_id":"speculative-caching-giảm-time-to-first-token-với-cache-dự-doan","title":"Speculative Caching — Giảm time-to-first-token với cache dự đoán","description":"\n\u003cp\u003eTrong các ứng dụng conversational AI, \u003cstrong\u003etime-to-first-token (TTFT)\u003c\/strong\u003e là metric quan trọng nhất ảnh hưởng đến user experience. Khi user gửi message, họ không muốn nhìn màn hình trống 3-5 giây. \u003cstrong\u003eSpeculative Caching\u003c\/strong\u003e là kỹ thuật pre-warm cache với context được dự đoán — để khi request thực sự đến, cache đã sẵn sàng.\u003c\/p\u003e\n\n\u003ch2\u003ePrompt Caching — Nền tảng cần hiểu trước\u003c\/h2\u003e\n\n\u003cp\u003eAnthropic Prompt Caching cho phép bạn cache phần đầu của prompt (system prompt + documents) và tái sử dụng ở các requests tiếp theo:\u003c\/p\u003e\n\n\u003cpre\u003e\u003ccode\u003eimport anthropic\n\nclient = anthropic.Anthropic()\n\n# Không có caching — mỗi request trả tiền đầy đủ input tokens\nresponse_no_cache = client.messages.create(\n    model=\"claude-haiku-4-5\",\n    max_tokens=1000,\n    system=\"You are a helpful assistant with extensive knowledge of Vietnamese law...\",  # 5000 tokens\n    messages=[{\"role\": \"user\", \"content\": \"Giải thích Điều 32 Bộ luật Dân sự\"}]\n)\n\n# Với caching — system prompt được cache sau lần đầu\nresponse_with_cache = client.messages.create(\n    model=\"claude-haiku-4-5\",\n    max_tokens=1000,\n    system=[{\n        \"type\": \"text\",\n        \"text\": \"You are a helpful assistant with extensive knowledge of Vietnamese law...\",\n        \"cache_control\": {\"type\": \"ephemeral\"}  # Mark for caching\n    }],\n    messages=[{\"role\": \"user\", \"content\": \"Giải thích Điều 32 Bộ luật Dân sự\"}]\n)\n\n# Lần 2 trở đi: system prompt tokens được đọc từ cache\n# Cost giảm 90% cho cached tokens\n# TTFT giảm vì không cần process lại system prompt\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003cp\u003eCache tồn tại tối thiểu 5 phút và được extend mỗi khi hit. Đây là foundation của Speculative Caching.\u003c\/p\u003e\n\n\u003ch2\u003eSpeculative Caching là gì?\u003c\/h2\u003e\n\n\u003cp\u003eStandard caching reactive: cache được warm khi user thực sự gọi API. Speculative Caching \u003cstrong\u003eproactive\u003c\/strong\u003e: bạn dự đoán user sẽ cần gì và warm cache \u003cem\u003etrước\u003c\/em\u003e khi họ hỏi.\u003c\/p\u003e\n\n\u003cp\u003eVí dụ: User đang xem product page cho \"iPhone 15 Pro\". Khả năng cao họ sẽ hỏi về specs, giá, hoặc so sánh. Bạn có thể pre-warm cache với product context ngay khi họ load page — trước khi họ gõ bất cứ điều gì.\u003c\/p\u003e\n\n\u003ch2\u003ePattern 1: Page-Based Prefetching\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eimport anthropic\nimport asyncio\nimport time\nfrom typing import Optional\n\nclient = anthropic.Anthropic()\nasync_client = anthropic.AsyncAnthropic()\n\nclass SpeculativeCacheManager:\n    def __init__(self):\n        self.cache_registry = {}  # track what's been pre-warmed\n\n    def warm_cache_for_context(self, context_id: str, context_content: str,\n                                system_role: str = \"helpful assistant\") -\u0026gt; dict:\n        \"\"\"\n        Pre-warm cache với một context cụ thể.\n        Gọi hàm này ngay khi user bắt đầu interact với một page\/resource.\n        \"\"\"\n        if context_id in self.cache_registry:\n            age_seconds = time.time() - self.cache_registry[context_id][\"warmed_at\"]\n            if age_seconds \u0026lt; 240:  # Cache still warm (5 min TTL - 1 min buffer)\n                return {\"status\": \"already_warm\", \"age_seconds\": int(age_seconds)}\n\n        # Gửi request nhỏ để warm cache\n        start = time.time()\n        response = client.messages.create(\n            model=\"claude-haiku-4-5\",\n            max_tokens=1,  # Minimal output — chỉ cần warm cache\n            system=[\n                {\n                    \"type\": \"text\",\n                    \"text\": f\"You are a {system_role}.\",\n                    \"cache_control\": {\"type\": \"ephemeral\"}\n                },\n                {\n                    \"type\": \"text\",\n                    \"text\": context_content,\n                    \"cache_control\": {\"type\": \"ephemeral\"}\n                }\n            ],\n            messages=[{\"role\": \"user\", \"content\": \"Ready.\"}]\n        )\n\n        warm_time = int((time.time() - start) * 1000)\n        cache_tokens = getattr(response.usage, 'cache_creation_input_tokens', 0) or 0\n\n        self.cache_registry[context_id] = {\n            \"warmed_at\": time.time(),\n            \"cache_tokens\": cache_tokens,\n            \"warm_time_ms\": warm_time\n        }\n\n        print(f\"Cache warmed: {context_id} | {cache_tokens} tokens | {warm_time}ms\")\n        return {\"status\": \"warmed\", \"cache_tokens\": cache_tokens, \"warm_time_ms\": warm_time}\n\n    def query_with_cache(self, context_id: str, context_content: str,\n                          system_role: str, user_message: str) -\u0026gt; dict:\n        \"\"\"Query Claude using pre-warmed cache\"\"\"\n        start = time.time()\n\n        response = client.messages.create(\n            model=\"claude-haiku-4-5\",\n            max_tokens=2000,\n            system=[\n                {\n                    \"type\": \"text\",\n                    \"text\": f\"You are a {system_role}.\",\n                    \"cache_control\": {\"type\": \"ephemeral\"}\n                },\n                {\n                    \"type\": \"text\",\n                    \"text\": context_content,\n                    \"cache_control\": {\"type\": \"ephemeral\"}\n                }\n            ],\n            messages=[{\"role\": \"user\", \"content\": user_message}]\n        )\n\n        latency_ms = int((time.time() - start) * 1000)\n        cache_read = getattr(response.usage, 'cache_read_input_tokens', 0) or 0\n        cache_written = getattr(response.usage, 'cache_creation_input_tokens', 0) or 0\n        cache_hit = cache_read \u0026gt; 0\n\n        return {\n            \"response\": response.content[0].text,\n            \"latency_ms\": latency_ms,\n            \"cache_hit\": cache_hit,\n            \"cache_read_tokens\": cache_read,\n            \"cache_write_tokens\": cache_written,\n            \"input_tokens\": response.usage.input_tokens,\n            \"output_tokens\": response.usage.output_tokens\n        }\n\ncache_manager = SpeculativeCacheManager()\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003ePattern 2: User Journey Prediction\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eclass JourneyBasedSpeculator:\n    \"\"\"\n    Dự đoán next context dựa trên user journey patterns.\n    Pre-warm cache cho predicted next step.\n    \"\"\"\n\n    # Định nghĩa journey maps: current_page -\u0026gt; likely_next_pages\n    JOURNEY_GRAPH = {\n        \"product_listing\": [\"product_detail\", \"category_filter\"],\n        \"product_detail\": [\"checkout\", \"comparison\", \"reviews\"],\n        \"checkout_cart\": [\"checkout_shipping\", \"product_detail\"],\n        \"checkout_shipping\": [\"checkout_payment\"],\n        \"checkout_payment\": [\"order_confirmation\"],\n        \"user_profile\": [\"order_history\", \"settings\", \"wishlist\"],\n        \"order_history\": [\"order_detail\", \"return_request\"],\n    }\n\n    def __init__(self, content_loader: callable):\n        self.content_loader = content_loader  # Function to load page content\n        self.cache_manager = SpeculativeCacheManager()\n        self.prefetch_tasks = {}\n\n    async def prefetch_predicted_contexts(self, current_page: str, current_entity_id: str):\n        \"\"\"Async prefetch predicted next pages\"\"\"\n        predicted_pages = self.JOURNEY_GRAPH.get(current_page, [])\n\n        async def warm_single(page: str):\n            content = await self.async_load_content(page, current_entity_id)\n            if content:\n                return await self.async_warm_cache(f\"{page}_{current_entity_id}\", content)\n\n        tasks = [warm_single(page) for page in predicted_pages]\n        results = await asyncio.gather(*tasks, return_exceptions=True)\n\n        print(f\"Pre-warmed {len([r for r in results if not isinstance(r, Exception)])} contexts for '{current_page}' journey\")\n        return results\n\n    async def async_warm_cache(self, context_id: str, content: str):\n        \"\"\"Async version của cache warming\"\"\"\n        response = await async_client.messages.create(\n            model=\"claude-haiku-4-5\",\n            max_tokens=1,\n            system=[{\n                \"type\": \"text\",\n                \"text\": content,\n                \"cache_control\": {\"type\": \"ephemeral\"}\n            }],\n            messages=[{\"role\": \"user\", \"content\": \"Ready.\"}]\n        )\n        return context_id\n\n    async def async_load_content(self, page: str, entity_id: str) -\u0026gt; Optional[str]:\n        \"\"\"Load content cho một page — implement với real data source\"\"\"\n        # Mock implementation\n        content_map = {\n            \"product_detail\": f\"Product ID: {entity_id}\nName: Sample Product\nPrice: $99\nSpecs: ...\",\n            \"checkout\": f\"Cart for user: {entity_id}\nItems: 3\nTotal: $297\",\n            \"reviews\": f\"Reviews for product {entity_id}\n5 stars: 80%\n4 stars: 15%\nLatest: ...\"\n        }\n        return content_map.get(page)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003ePattern 3: Time-Based Refresh\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eclass AutoRefreshCacheManager:\n    \"\"\"\n    Tự động refresh cache cho contexts có high access frequency.\n    Đảm bảo cache luôn warm cho popular content.\n    \"\"\"\n\n    def __init__(self, refresh_interval: int = 240):  # 4 minutes (before 5-min TTL)\n        self.contexts = {}  # context_id -\u0026gt; {content, system_role, last_refresh}\n        self.refresh_interval = refresh_interval\n        self.running = False\n\n    def register_hot_context(self, context_id: str, content: str, system_role: str = \"assistant\"):\n        \"\"\"Đăng ký một context để auto-refresh\"\"\"\n        self.contexts[context_id] = {\n            \"content\": content,\n            \"system_role\": system_role,\n            \"last_refresh\": 0,\n            \"hit_count\": 0\n        }\n        # Warm immediately\n        self._warm_cache(context_id)\n\n    def _warm_cache(self, context_id: str):\n        ctx = self.contexts[context_id]\n        client.messages.create(\n            model=\"claude-haiku-4-5\",\n            max_tokens=1,\n            system=[{\n                \"type\": \"text\",\n                \"text\": ctx[\"content\"],\n                \"cache_control\": {\"type\": \"ephemeral\"}\n            }],\n            messages=[{\"role\": \"user\", \"content\": \"Ready.\"}]\n        )\n        self.contexts[context_id][\"last_refresh\"] = time.time()\n\n    async def run_refresh_loop(self):\n        \"\"\"Background loop để refresh expiring caches\"\"\"\n        self.running = True\n        while self.running:\n            now = time.time()\n            for context_id, ctx in self.contexts.items():\n                age = now - ctx[\"last_refresh\"]\n                if age \u0026gt;= self.refresh_interval:\n                    print(f\"Auto-refreshing cache: {context_id}\")\n                    self._warm_cache(context_id)\n\n            await asyncio.sleep(60)  # Check every minute\n\n    def stop(self):\n        self.running = False\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eĐo lường hiệu quả Speculative Caching\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003edef benchmark_speculative_caching(context_content: str, test_queries: list[str]) -\u0026gt; dict:\n    \"\"\"So sánh latency: cold cache vs warm cache\"\"\"\n\n    manager = SpeculativeCacheManager()\n\n    print(\"Benchmarking Speculative Caching...\")\n    print(\"=\" * 50)\n\n    # Test 1: Cold cache (no pre-warming)\n    cold_latencies = []\n    for query in test_queries[:3]:\n        start = time.time()\n        result = manager.query_with_cache(\n            f\"cold_{time.time()}\", context_content, \"assistant\", query\n        )\n        cold_latencies.append(result[\"latency_ms\"])\n        print(f\"Cold: {result['latency_ms']}ms | Cache hit: {result['cache_hit']}\")\n\n    # Test 2: Pre-warm then query\n    warm_latencies = []\n    ctx_id = \"benchmark_warm\"\n\n    # Pre-warm\n    warm_result = manager.warm_cache_for_context(ctx_id, context_content)\n    print(f\"\nPre-warm: {warm_result['warm_time_ms']}ms | {warm_result['cache_tokens']} tokens cached\")\n\n    # Now query with warm cache\n    for query in test_queries[:3]:\n        start = time.time()\n        result = manager.query_with_cache(ctx_id, context_content, \"assistant\", query)\n        warm_latencies.append(result[\"latency_ms\"])\n        print(f\"Warm: {result['latency_ms']}ms | Cache hit: {result['cache_hit']}\")\n\n    avg_cold = sum(cold_latencies) \/ len(cold_latencies) if cold_latencies else 0\n    avg_warm = sum(warm_latencies) \/ len(warm_latencies) if warm_latencies else 0\n    improvement = ((avg_cold - avg_warm) \/ avg_cold * 100) if avg_cold \u0026gt; 0 else 0\n\n    print(f\"\nResults:\")\n    print(f\"  Avg Cold TTFT: {avg_cold:.0f}ms\")\n    print(f\"  Avg Warm TTFT: {avg_warm:.0f}ms\")\n    print(f\"  TTFT Improvement: {improvement:.1f}%\")\n\n    return {\n        \"avg_cold_ms\": avg_cold,\n        \"avg_warm_ms\": avg_warm,\n        \"ttft_improvement_pct\": improvement\n    }\n\n# Benchmark\ntest_context = \"Vietnamese Legal Code Content:\n\" + \"Article text... \" * 100  # Long context\ntest_queries = [\n    \"Điều 32 quy định gì về quyền cá nhân?\",\n    \"Hình phạt cho vi phạm Điều 15?\",\n    \"Giải thích khái niệm 'pháp nhân' theo bộ luật\"\n]\n\nbenchmark_results = benchmark_speculative_caching(test_context, test_queries)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eChi phí và ROI của Speculative Caching\u003c\/h2\u003e\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n\u003cth\u003eMetric\u003c\/th\u003e\n\u003cth\u003eKhông có Speculative Cache\u003c\/th\u003e\n\u003cth\u003eVới Speculative Cache\u003c\/th\u003e\n\u003c\/tr\u003e\n  \u003c\/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n\u003ctd\u003eTTFT (lần đầu)\u003c\/td\u003e\n\u003ctd\u003e2,000-5,000ms\u003c\/td\u003e\n\u003ctd\u003e300-800ms\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eTTFT (lần tiếp)\u003c\/td\u003e\n\u003ctd\u003e2,000-5,000ms\u003c\/td\u003e\n\u003ctd\u003e200-400ms\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eInput token cost\u003c\/td\u003e\n\u003ctd\u003e100% (full price)\u003c\/td\u003e\n\u003ctd\u003e10% (cache read)\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003ePre-warm cost\u003c\/td\u003e\n\u003ctd\u003eN\/A\u003c\/td\u003e\n\u003ctd\u003eMột lần cache write = 125% của input\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003eBreak-even\u003c\/td\u003e\n\u003ctd\u003e—\u003c\/td\u003e\n\u003ctd\u003eSau 1.4 cache hits\u003c\/td\u003e\n\u003c\/tr\u003e\n  \u003c\/tbody\u003e\n\u003c\/table\u003e\n\n\u003cp\u003eVới bất kỳ context nào được hit 2+ lần, Speculative Caching tiết kiệm cả chi phí lẫn latency.\u003c\/p\u003e\n\n\u003ch2\u003eKhi nào dùng Speculative Caching\u003c\/h2\u003e\n\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eDocumentation chatbots\u003c\/strong\u003e — Pre-warm khi user mở doc page\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eCustomer support\u003c\/strong\u003e — Pre-warm product\/order context khi user login\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eE-commerce\u003c\/strong\u003e — Pre-warm product catalog khi user browse category\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eLegal\/Medical AI\u003c\/strong\u003e — Pre-warm case\/patient file khi practitioner opens it\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eCode assistants\u003c\/strong\u003e — Pre-warm codebase context khi dev opens project\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eTổng kết\u003c\/h2\u003e\n\n\u003cp\u003eSpeculative Caching là kỹ thuật high-impact, low-complexity. Bằng cách dự đoán user journey và pre-warm cache, bạn giảm TTFT từ giây xuống còn milliseconds — cải thiện UX đáng kể mà không cần thay đổi model hay infrastructure.\u003c\/p\u003e\n\n\u003cp\u003eKey implementation steps:\u003c\/p\u003e\n\u003cul\u003e\n  \u003cli\u003eAdd \u003ccode\u003ecache_control: {type: \"ephemeral\"}\u003c\/code\u003e vào system prompt và documents dài\u003c\/li\u003e\n  \u003cli\u003eGọi warm cache API ngay khi user bắt đầu session hoặc navigate đến page\u003c\/li\u003e\n  \u003cli\u003eRefresh cache mỗi 4 phút cho hot contexts để tránh expiry\u003c\/li\u003e\n  \u003cli\u003eMonitor cache hit rate để đánh giá prediction accuracy\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003cp\u003eĐọc thêm: \u003ca href=\"\/collections\/nang-cao\"\u003eUsage \u0026amp; Cost API\u003c\/a\u003e để theo dõi cache hit rates và tối ưu chiến lược caching.\u003c\/p\u003e\n\n\u003chr\u003e\n\u003ch3\u003eBài viết liên quan\u003c\/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"\/products\/contextual-retrieval-nang-c%E1%BA%A5p-rag-v%E1%BB%9Bi-embeddings-ng%E1%BB%AF-c%E1%BA%A3nh\"\u003eContextual Retrieval — Nâng cấp RAG với embeddings ngữ cảnh\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/best-practices-cho-vision-t%E1%BB%91i-%C6%B0u-hinh-%E1%BA%A3nh-g%E1%BB%ADi-claude\"\u003eBest Practices cho Vision — Tối ưu hình ảnh gửi Claude\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/agent-workflows-chaining-routing-parallelization\"\u003eAgent Workflows — Chaining, Routing, Parallelization\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/context-engineering-ngh%E1%BB%87-thu%E1%BA%ADt-qu%E1%BA%A3n-ly-context-cho-claude\"\u003eContext Engineering — Nghệ thuật quản lý context cho Claude\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/claude-cho-engineering-standup-va-bao-cao-ti%E1%BA%BFn-d%E1%BB%99\"\u003eClaude cho Engineering: Standup và báo cáo tiến độ\u003c\/a\u003e\u003c\/li\u003e\n\u003c\/ul\u003e","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47721900179668,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/speculative-caching-gi_m-time-to-first-token-v_i-cache-d_-doan.jpg?v=1774521783","url":"https:\/\/claude.vn\/products\/speculative-caching-gi%e1%ba%a3m-time-to-first-token-v%e1%bb%9bi-cache-d%e1%bb%b1-doan","provider":"CLAUDE.VN","version":"1.0","type":"link"}