Nâng caoHướng dẫnClaude Agent SDKNguồn: Anthropic

SRE Agent — Tự động incident response với Claude SDK

Minh TuấnCTO, Transform GroupTheo dõi

26/03/2026 0 0 3 phút đọc

Nghe bài viết

00:00

3 giờ sáng, pager kêu, API trả về 500s. Bạn nửa tỉnh nửa mê, stare vào dashboards, cố correlate metrics và logs across hàng chục services trong khi customer impact tăng từng phút. Bài này xây dựng SRE incident response agent xử lý workflow đó tự động: investigate incidents, identify root causes, apply remediations, và document kết quả.

Bài viết dựa trên Claude Cookbooks chính thức của Anthropic.

Bạn sẽ học được gì

Cho agent safe write access vào infrastructure bằng cách scope MCP tools với restricted directories, command allowlists, và validation hooks
Tại sao tool descriptions rõ ràng drive agent behavior hiệu quả hơn elaborate prompts
Agent synthesize across production signals — metrics, logs, alerts, config — để build diagnosis mà không single data source nào reveal được
Cấu trúc human-in-the-loop workflows tách investigation khỏi remediation

Kiến trúc: MCP Pattern

Claude Agent SDK  <-- query() loop streams responses
    |
    v
MCP Server (subprocess via stdio/JSON-RPC)
    |
    +-- Prometheus (metrics & health checks)
    +-- Docker (container logs & commands)
    +-- Config Management (read/edit env files)

Tại sao subprocess? Isolation — agent không bị ảnh hưởng nếu tool handler crash. Clean separation giữa reasoning loop và infrastructure access layer.

MCP Server: 12 Tools trong 4 Categories

Category	Tools	Purpose
Prometheus	query_metrics, list_metrics, get_service_health	Query metrics, discover data, health summaries
Infrastructure	read_config_file, edit_config_file, run_shell_command, get_container_logs	Read/write configs, Docker commands, inspect logs
Diagnostics	get_logs, get_alerts, get_recent_deployments, execute_runbook	Application logs, alert history, deployment tracking
Documentation	write_postmortem	Write incident post-mortems

Mỗi tool có JSON Schema definition với rich description — đây là thứ agent đọc để quyết định khi nào và cách dùng. Tool descriptions tốt là yếu tố quan trọng nhất cho agent effectiveness.

Safety: Scoped Write Access

Cho agent write access nhưng với guardrails chặt chẽ:

1. Restricted directories

# edit_config_file CHỈ cho phép write trong config/
def handle_edit_config(args):
    filepath = args["filepath"]
    if not filepath.startswith("config/"):
        return {"error": "Write restricted to config/ directory"}
    # ... proceed with edit

2. Command allowlists

# run_shell_command CHỈ cho phép docker commands
def handle_shell_command(args):
    command = args["command"]
    if not command.startswith(("docker-compose", "docker")):
        return {"error": "Only docker commands allowed"}
    return subprocess.run(command, shell=True)

3. Container name validation

# get_container_logs validate container name against whitelist
ALLOWED_CONTAINERS = ["api-server", "postgres", "prometheus"]

Investigation → Remediation Workflow

Tách 2 giai đoạn rõ ràng — human-in-the-loop giữa investigate và fix:

Phase 1: Investigation (tự động)

result = await query(
    prompt="API server trả 500 errors. Investigate root cause.",
    mcp_servers={"sre": sre_mcp_config},
    allowed_tools=[
        "mcp__sre__query_metrics",
        "mcp__sre__get_service_health",
        "mcp__sre__get_container_logs",
        "mcp__sre__get_alerts"
    ]  # Chỉ read tools
)

Agent tự tổng hợp: Prometheus metrics + Docker logs + alerts + config → diagnosis hoàn chỉnh.

Human Review

Agent trình bày: root cause, evidence, proposed remediation. Engineer review và approve.

Phase 2: Remediation (sau khi approved)

result = await query(
    prompt="Approved: fix DB pool size from 1 to 10",
    allowed_tools=[
        "mcp__sre__edit_config_file",
        "mcp__sre__run_shell_command"
    ]  # Write tools enabled
)

Ví dụ thực tế: DB Pool Size Incident

Scenario: API server error rate tăng vọt vì DB_POOL_SIZE bị set thành 1 (thay vì 10).

Agent queries Prometheus: error rate 45%, latency p99 = 5.2s
Agent checks container logs: "connection pool exhausted" errors
Agent reads config file: DB_POOL_SIZE=1 (quá thấp)
Agent correlates: recent deployment changed pool size
Agent proposes fix: set DB_POOL_SIZE=10, restart API server
After approval: agent edits config và restarts container
Agent verifies: error rate drops to 0%, latency normalizes
Agent writes postmortem document

Production Extensions

MCP server hỗ trợ thêm production tools khi có API keys:

PagerDuty — Incident management, auto-acknowledge, escalation
Confluence — Post-mortem documentation tự động
Slack — Notify team về incidents và resolutions
Datadog/Grafana — Extended metrics và dashboards

Best Practices cho SRE Agents

Tool descriptions > prompts — Invest vào viết descriptions rõ ràng cho mỗi tool. Agent dựa vào descriptions để quyết định.
Scope write access — Restricted directories, command allowlists, container whitelists.
Human-in-the-loop — Tách investigation (auto) và remediation (after approval).
Validation hooks — PreToolUse hooks kiểm tra trước khi agent thực thi destructive commands.
Postmortem documentation — Agent tự document mọi thứ: timeline, root cause, fix, lessons learned.

Bước tiếp theo: Đọc Migration từ OpenAI SDK nếu đang chuyển đổi, hoặc quay lại Research Agent để bắt đầu từ basics.

Tính năng liên quan:Agent SDK MCP SRE Incident Response Prometheus Docker Human-in-the-Loop

Bai viet co huu ich khong?

Writer cho nền tảng kiến thức Claude AI cho người Việt. Software engineer với hơn 20 năm kinh nghiệm, đam mê AI và chia sẻ kiến thức công nghệ.

5 bài viết · 16K lượt đọc

Bình luận (0)

Đăng nhập để bình luận...

Đăng nhập để bình luận

Đang tải bình luận...

Gợi ý cho bạn

Observability Agent — Tích hợp MCP và giám sát hệ thống với Claude SDK

SRE Agent — Tự động incident response với Claude SDK

Bạn sẽ học được gì

Kiến trúc: MCP Pattern

MCP Server: 12 Tools trong 4 Categories

Safety: Scoped Write Access

1. Restricted directories

2. Command allowlists

3. Container name validation

Investigation → Remediation Workflow

Phase 1: Investigation (tự động)

Human Review

Phase 2: Remediation (sau khi approved)

Ví dụ thực tế: DB Pool Size Incident

Production Extensions

Best Practices cho SRE Agents

Gợi ý cho bạn

Observability Agent — Tích hợp MCP và giám sát hệ thống với Claude SDK

Tự Động Hóa Workflow với Claude AI: Hướng Dẫn Thực Tế Cho Business Teams

Chuyển từ OpenAI Agents SDK sang Claude — Hướng dẫn migration chi tiết

Hướng Dẫn Dùng MCP với Claude: Từng Bước với Ví Dụ GitHub và Slack

Tin liên quan nên xem

MCP là gì? Giải Thích Toàn Diện Model Context Protocol và Bảo Mật

Model Context Protocol (MCP): Hướng Dẫn Toàn Diện Từ A Đến Z

MCP (Model Context Protocol) — Giải thích đơn giản

Claude + Common Room: Nghiên cứu account từ community data

SRE Agent — Tự động incident response với Claude SDK

Bạn sẽ học được gì

Kiến trúc: MCP Pattern

MCP Server: 12 Tools trong 4 Categories

Safety: Scoped Write Access

1. Restricted directories

2. Command allowlists

3. Container name validation

Investigation → Remediation Workflow

Phase 1: Investigation (tự động)

Human Review

Phase 2: Remediation (sau khi approved)

Ví dụ thực tế: DB Pool Size Incident

Production Extensions

Best Practices cho SRE Agents

Gợi ý cho bạn

Observability Agent — Tích hợp MCP và giám sát hệ thống với Claude SDK

Tự Động Hóa Workflow với Claude AI: Hướng Dẫn Thực Tế Cho Business Teams

Chuyển từ OpenAI Agents SDK sang Claude — Hướng dẫn migration chi tiết

Hướng Dẫn Dùng MCP với Claude: Từng Bước với Ví Dụ GitHub và Slack

Tin liên quan nên xem

MCP là gì? Giải Thích Toàn Diện Model Context Protocol và Bảo Mật

Model Context Protocol (MCP): Hướng Dẫn Toàn Diện Từ A Đến Z

MCP (Model Context Protocol) — Giải thích đơn giản

Claude + Common Room: Nghiên cứu account từ community data

Đăng ký nhận bản tin