{"product_id":"sre-agent-tự-dộng-incident-response-với-claude-sdk","title":"SRE Agent — Tự động incident response với Claude SDK","description":"\n\u003cp\u003e3 giờ sáng, pager kêu, API trả về 500s. Bạn nửa tỉnh nửa mê, stare vào dashboards, cố correlate metrics và logs across hàng chục services trong khi customer impact tăng từng phút. Bài này xây dựng \u003cstrong\u003eSRE incident response agent\u003c\/strong\u003e xử lý workflow đó tự động: investigate incidents, identify root causes, apply remediations, và document kết quả.\u003c\/p\u003e\n\n\u003cp\u003eBài viết dựa trên \u003cstrong\u003eClaude Cookbooks chính thức\u003c\/strong\u003e của Anthropic.\u003c\/p\u003e\n\n\u003ch2\u003eBạn sẽ học được gì\u003c\/h2\u003e\n\n\u003cul\u003e\n  \u003cli\u003eCho agent \u003cstrong\u003esafe write access\u003c\/strong\u003e vào infrastructure bằng cách scope MCP tools với restricted directories, command allowlists, và validation hooks\u003c\/li\u003e\n  \u003cli\u003eTại sao \u003cstrong\u003etool descriptions rõ ràng\u003c\/strong\u003e drive agent behavior hiệu quả hơn elaborate prompts\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003esynthesize across production signals\u003c\/strong\u003e — metrics, logs, alerts, config — để build diagnosis mà không single data source nào reveal được\u003c\/li\u003e\n  \u003cli\u003eCấu trúc \u003cstrong\u003ehuman-in-the-loop workflows\u003c\/strong\u003e tách investigation khỏi remediation\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eKiến trúc: MCP Pattern\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eClaude Agent SDK  \u0026lt;-- query() loop streams responses\n    |\n    v\nMCP Server (subprocess via stdio\/JSON-RPC)\n    |\n    +-- Prometheus (metrics \u0026amp; health checks)\n    +-- Docker (container logs \u0026amp; commands)\n    +-- Config Management (read\/edit env files)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003cp\u003e\u003cstrong\u003eTại sao subprocess?\u003c\/strong\u003e Isolation — agent không bị ảnh hưởng nếu tool handler crash. Clean separation giữa reasoning loop và infrastructure access layer.\u003c\/p\u003e\n\n\u003ch2\u003eMCP Server: 12 Tools trong 4 Categories\u003c\/h2\u003e\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n\u003cth\u003eCategory\u003c\/th\u003e\n\u003cth\u003eTools\u003c\/th\u003e\n\u003cth\u003ePurpose\u003c\/th\u003e\n\u003c\/tr\u003e\n  \u003c\/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n\u003ctd\u003e\u003cstrong\u003ePrometheus\u003c\/strong\u003e\u003c\/td\u003e\n\u003ctd\u003equery_metrics, list_metrics, get_service_health\u003c\/td\u003e\n\u003ctd\u003eQuery metrics, discover data, health summaries\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003e\u003cstrong\u003eInfrastructure\u003c\/strong\u003e\u003c\/td\u003e\n\u003ctd\u003eread_config_file, edit_config_file, run_shell_command, get_container_logs\u003c\/td\u003e\n\u003ctd\u003eRead\/write configs, Docker commands, inspect logs\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003e\u003cstrong\u003eDiagnostics\u003c\/strong\u003e\u003c\/td\u003e\n\u003ctd\u003eget_logs, get_alerts, get_recent_deployments, execute_runbook\u003c\/td\u003e\n\u003ctd\u003eApplication logs, alert history, deployment tracking\u003c\/td\u003e\n\u003c\/tr\u003e\n    \u003ctr\u003e\n\u003ctd\u003e\u003cstrong\u003eDocumentation\u003c\/strong\u003e\u003c\/td\u003e\n\u003ctd\u003ewrite_postmortem\u003c\/td\u003e\n\u003ctd\u003eWrite incident post-mortems\u003c\/td\u003e\n\u003c\/tr\u003e\n  \u003c\/tbody\u003e\n\u003c\/table\u003e\n\n\u003cp\u003eMỗi tool có \u003cstrong\u003eJSON Schema definition với rich description\u003c\/strong\u003e — đây là thứ agent đọc để quyết định khi nào và cách dùng. Tool descriptions tốt là \u003cstrong\u003eyếu tố quan trọng nhất\u003c\/strong\u003e cho agent effectiveness.\u003c\/p\u003e\n\n\u003ch2\u003eSafety: Scoped Write Access\u003c\/h2\u003e\n\n\u003cp\u003eCho agent write access nhưng với guardrails chặt chẽ:\u003c\/p\u003e\n\n\u003ch3\u003e1. Restricted directories\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003e# edit_config_file CHỈ cho phép write trong config\/\ndef handle_edit_config(args):\n    filepath = args[\"filepath\"]\n    if not filepath.startswith(\"config\/\"):\n        return {\"error\": \"Write restricted to config\/ directory\"}\n    # ... proceed with edit\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e2. Command allowlists\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003e# run_shell_command CHỈ cho phép docker commands\ndef handle_shell_command(args):\n    command = args[\"command\"]\n    if not command.startswith((\"docker-compose\", \"docker\")):\n        return {\"error\": \"Only docker commands allowed\"}\n    return subprocess.run(command, shell=True)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003e3. Container name validation\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003e# get_container_logs validate container name against whitelist\nALLOWED_CONTAINERS = [\"api-server\", \"postgres\", \"prometheus\"]\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eInvestigation → Remediation Workflow\u003c\/h2\u003e\n\n\u003cp\u003eTách 2 giai đoạn rõ ràng — human-in-the-loop giữa investigate và fix:\u003c\/p\u003e\n\n\u003ch3\u003ePhase 1: Investigation (tự động)\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003eresult = await query(\n    prompt=\"API server trả 500 errors. Investigate root cause.\",\n    mcp_servers={\"sre\": sre_mcp_config},\n    allowed_tools=[\n        \"mcp__sre__query_metrics\",\n        \"mcp__sre__get_service_health\",\n        \"mcp__sre__get_container_logs\",\n        \"mcp__sre__get_alerts\"\n    ]  # Chỉ read tools\n)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003cp\u003eAgent tự tổng hợp: Prometheus metrics + Docker logs + alerts + config → \u003cstrong\u003ediagnosis hoàn chỉnh\u003c\/strong\u003e.\u003c\/p\u003e\n\n\u003ch3\u003eHuman Review\u003c\/h3\u003e\n\u003cp\u003eAgent trình bày: root cause, evidence, proposed remediation. Engineer review và approve.\u003c\/p\u003e\n\n\u003ch3\u003ePhase 2: Remediation (sau khi approved)\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003eresult = await query(\n    prompt=\"Approved: fix DB pool size from 1 to 10\",\n    allowed_tools=[\n        \"mcp__sre__edit_config_file\",\n        \"mcp__sre__run_shell_command\"\n    ]  # Write tools enabled\n)\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eVí dụ thực tế: DB Pool Size Incident\u003c\/h2\u003e\n\n\u003cp\u003eScenario: API server error rate tăng vọt vì DB_POOL_SIZE bị set thành 1 (thay vì 10).\u003c\/p\u003e\n\n\u003col\u003e\n  \u003cli\u003eAgent \u003cstrong\u003equeries Prometheus\u003c\/strong\u003e: error rate 45%, latency p99 = 5.2s\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003echecks container logs\u003c\/strong\u003e: \"connection pool exhausted\" errors\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003ereads config file\u003c\/strong\u003e: DB_POOL_SIZE=1 (quá thấp)\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003ecorrelates\u003c\/strong\u003e: recent deployment changed pool size\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003eproposes fix\u003c\/strong\u003e: set DB_POOL_SIZE=10, restart API server\u003c\/li\u003e\n  \u003cli\u003eAfter approval: agent \u003cstrong\u003eedits config\u003c\/strong\u003e và \u003cstrong\u003erestarts container\u003c\/strong\u003e\n\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003everifies\u003c\/strong\u003e: error rate drops to 0%, latency normalizes\u003c\/li\u003e\n  \u003cli\u003eAgent \u003cstrong\u003ewrites postmortem\u003c\/strong\u003e document\u003c\/li\u003e\n\u003c\/ol\u003e\n\n\u003ch2\u003eProduction Extensions\u003c\/h2\u003e\n\n\u003cp\u003eMCP server hỗ trợ thêm production tools khi có API keys:\u003c\/p\u003e\n\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003ePagerDuty\u003c\/strong\u003e — Incident management, auto-acknowledge, escalation\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eConfluence\u003c\/strong\u003e — Post-mortem documentation tự động\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eSlack\u003c\/strong\u003e — Notify team về incidents và resolutions\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eDatadog\/Grafana\u003c\/strong\u003e — Extended metrics và dashboards\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eBest Practices cho SRE Agents\u003c\/h2\u003e\n\n\u003col\u003e\n  \u003cli\u003e\n\u003cstrong\u003eTool descriptions \u0026gt; prompts\u003c\/strong\u003e — Invest vào viết descriptions rõ ràng cho mỗi tool. Agent dựa vào descriptions để quyết định.\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eScope write access\u003c\/strong\u003e — Restricted directories, command allowlists, container whitelists.\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eHuman-in-the-loop\u003c\/strong\u003e — Tách investigation (auto) và remediation (after approval).\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eValidation hooks\u003c\/strong\u003e — PreToolUse hooks kiểm tra trước khi agent thực thi destructive commands.\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003ePostmortem documentation\u003c\/strong\u003e — Agent tự document mọi thứ: timeline, root cause, fix, lessons learned.\u003c\/li\u003e\n\u003c\/ol\u003e\n\n\u003cp\u003eBước tiếp theo: Đọc \u003ca href=\"\/collections\/nang-cao\"\u003eMigration từ OpenAI SDK\u003c\/a\u003e nếu đang chuyển đổi, hoặc quay lại \u003ca href=\"\/collections\/nang-cao\"\u003eResearch Agent\u003c\/a\u003e để bắt đầu từ basics.\u003c\/p\u003e\n","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47721724281044,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/sre-agent-t_-d_ng-incident-response-v_i-claude-sdk.jpg?v=1774505859","url":"https:\/\/claude.vn\/products\/sre-agent-t%e1%bb%b1-d%e1%bb%99ng-incident-response-v%e1%bb%9bi-claude-sdk","provider":"CLAUDE.VN","version":"1.0","type":"link"}