Trung cấpHướng dẫnClaude APINguồn: Anthropic

Observability cho Claude API — Metrics, Logging và Alerting production

Minh TuấnCTO, Transform GroupTheo dõi

28/03/2026 723 0 13 phút đọc

Nghe bài viết

00:00

1 Đây là metric quan trọng nhất với trải nghiệm người dùng vì nó quyết định thời gian người dùng phải chờ trước khi thấy kết quả bắt đầu xuất hiện.
2 Với ứng dụng Claude API, bạn nên định nghĩa SLO cho các khía cạnh sau: SLO 1: Availability Tỷ lệ request thành công trong khoảng thời gian.
3 Track chi phí theo user, feature, hoặc department để phân bổ ngân sách chính xác.
4 Dưới đây là các panel quan trọng bạn cần có trong dashboard giám sát Claude API.
5 Vi du: feature tom tat van ban don gian co the dung Haiku thay vi Sonnet, giam chi phi 75% ma chat luong van dap ung.

Khi ứng dụng sử dụng Claude API chuyển từ giai đoạn prototype sang production, một câu hỏi quan trọng xuất hiện: làm sao bạn biết hệ thống đang hoạt động tốt? Làm sao phát hiện sớm khi latency tăng đột biến, error rate vượt ngưỡng, hoặc chi phí API vượt budget? Câu trả lời nằm ở observability — khả năng quan sát và hiểu trạng thái bên trong hệ thống thông qua dữ liệu nó tạo ra.

Bài viết này hướng dẫn bạn xây dựng observability stack hoàn chỉnh cho ứng dụng Claude API, từ việc xác định metrics quan trọng, thiết lập structured logging, tích hợp OpenTelemetry, cho đến xây dựng dashboard và cấu hình alert rules.

Tại sao observability quan trọng với LLM API

Khác với API truyền thống, LLM API có những đặc điểm riêng khiến observability trở nên phức tạp hơn:

Latency biến động lớn: Một request đơn giản có thể mất 500ms, trong khi request phức tạp với output dài mất 30 giây trở lên
Chi phí theo token: Mỗi request có chi phí khác nhau tùy thuộc vào số lượng input và output tokens
Streaming response: Response được trả về theo chunks, khiến việc đo lường end-to-end latency phức tạp hơn
Rate limiting: Anthropic áp dụng rate limits theo tokens per minute và requests per minute, cần giám sát để tránh bị throttle
Chất lượng output: Không chỉ quan tâm hệ thống có hoạt động không, mà còn chất lượng kết quả có đáp ứng yêu cầu không

Các metrics quan trọng cần thu thập

1. Time to First Token (TTFT)

TTFT đo thời gian từ khi gửi request đến khi nhận được token đầu tiên trong response. Đây là metric quan trọng nhất với trải nghiệm người dùng vì nó quyết định thời gian người dùng phải chờ trước khi thấy kết quả bắt đầu xuất hiện.

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_metrics(messages, model="claude-sonnet-4-20250514"):
    start_time = time.monotonic()
    ttft = None
    total_input_tokens = 0
    total_output_tokens = 0
    chunks_received = 0

    with client.messages.stream(
        model=model,
        max_tokens=1024,
        messages=messages
    ) as stream:
        for event in stream:
            if ttft is None and hasattr(event, 'type') and event.type == 'content_block_delta':
                ttft = time.monotonic() - start_time
            chunks_received += 1

        response = stream.get_final_message()
        total_time = time.monotonic() - start_time
        total_input_tokens = response.usage.input_tokens
        total_output_tokens = response.usage.output_tokens

    return {
        "ttft_seconds": ttft,
        "total_time_seconds": total_time,
        "input_tokens": total_input_tokens,
        "output_tokens": total_output_tokens,
        "tokens_per_second": total_output_tokens / total_time if total_time > 0 else 0,
        "chunks_received": chunks_received,
    }

2. Tokens Per Second (TPS)

TPS đo tốc độ sinh token của model. Metric này giúp bạn phát hiện khi model đang chạy chậm hơn bình thường, có thể do tải cao phía Anthropic hoặc vấn đề mạng.

Công thức: TPS = output_tokens / (total_time - ttft). Lưu ý trừ đi TTFT vì giai đoạn đầu là thời gian model xử lý prompt, không phải thời gian sinh token.

3. Error Rate

Phân loại error theo HTTP status code để có hành động phù hợp:

400 Bad Request: Lỗi từ phía ứng dụng — prompt quá dài, format sai
401/403: Vấn đề authentication — API key hết hạn hoặc không hợp lệ
429 Rate Limited: Vượt quá giới hạn — cần điều chỉnh concurrency hoặc nâng tier
500/529 Server Error: Vấn đề phía Anthropic — cần retry với backoff

4. Cost Per Request

Tính chi phí theo công thức của Anthropic: (input_tokens * input_price + output_tokens * output_price). Track chi phí theo user, feature, hoặc department để phân bổ ngân sách chính xác.

PRICING = {
    "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
    "claude-haiku-4-20250414": {"input": 0.80, "output": 4.0},
    "claude-opus-4-20250514": {"input": 15.0, "output": 75.0},
}

def calculate_cost(model, input_tokens, output_tokens):
    prices = PRICING.get(model, PRICING["claude-sonnet-4-20250514"])
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (output_tokens / 1_000_000) * prices["output"]
    return {
        "input_cost_usd": input_cost,
        "output_cost_usd": output_cost,
        "total_cost_usd": input_cost + output_cost,
    }

5. Rate Limit Utilization

Anthropic trả về rate limit headers trong mỗi response. Thu thập và giám sát chúng để biết bạn đang sử dụng bao nhiêu phần trăm quota:

def extract_rate_limit_headers(response_headers):
    return {
        "requests_remaining": int(response_headers.get("anthropic-ratelimit-requests-remaining", 0)),
        "requests_limit": int(response_headers.get("anthropic-ratelimit-requests-limit", 0)),
        "tokens_remaining": int(response_headers.get("anthropic-ratelimit-tokens-remaining", 0)),
        "tokens_limit": int(response_headers.get("anthropic-ratelimit-tokens-limit", 0)),
        "requests_reset": response_headers.get("anthropic-ratelimit-requests-reset", ""),
        "tokens_reset": response_headers.get("anthropic-ratelimit-tokens-reset", ""),
    }

Structured Logging cho Claude API

Log không cấu trúc như "Called Claude API successfully" gần như vô dụng trong production. Structured logging sử dụng format JSON để mỗi log entry chứa đầy đủ context cần thiết cho việc debug và phân tích.

import logging
import json
import uuid
from datetime import datetime

class ClaudeAPILogger:
    def __init__(self):
        self.logger = logging.getLogger("claude_api")
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log_request(self, request_id, model, messages, metadata=None):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "event": "claude_api_request",
            "request_id": request_id,
            "model": model,
            "message_count": len(messages),
            "estimated_input_tokens": sum(len(m.get("content", "")) // 4 for m in messages),
            "metadata": metadata or {},
        }
        self.logger.info(json.dumps(log_entry))

    def log_response(self, request_id, metrics, metadata=None):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "event": "claude_api_response",
            "request_id": request_id,
            "ttft_ms": round(metrics["ttft_seconds"] * 1000, 2),
            "total_time_ms": round(metrics["total_time_seconds"] * 1000, 2),
            "input_tokens": metrics["input_tokens"],
            "output_tokens": metrics["output_tokens"],
            "tps": round(metrics["tokens_per_second"], 2),
            "cost_usd": metrics.get("cost_usd", 0),
            "metadata": metadata or {},
        }
        self.logger.info(json.dumps(log_entry))

    def log_error(self, request_id, error, metadata=None):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "event": "claude_api_error",
            "request_id": request_id,
            "error_type": type(error).__name__,
            "error_message": str(error),
            "status_code": getattr(error, 'status_code', None),
            "metadata": metadata or {},
        }
        self.logger.error(json.dumps(log_entry))

Nguyên tắc logging hiệu quả

Luôn gắn request_id: Cho phép trace toàn bộ lifecycle của một request từ khi nhận đến khi trả kết quả
Không log nội dung prompt/response: Tránh rủi ro bảo mật và tuân thủ GDPR/privacy. Chỉ log metadata
Log ở mức phù hợp: INFO cho request/response bình thường, WARNING cho retry, ERROR cho failures
Thêm business context: user_id, feature_name, department giúp phân tích chi phí và usage pattern

Tích hợp OpenTelemetry

OpenTelemetry (OTel) là tiêu chuẩn mở để thu thập telemetry data gồm traces, metrics và logs. Tích hợp OTel giúp bạn có cái nhìn end-to-end về performance và dễ dàng kết nối với nhiều backend khác nhau.

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import BatchSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

def setup_telemetry(service_name="claude-api-service", endpoint="localhost:4317"):
    # Traces
    trace_provider = TracerProvider()
    trace_provider.add_span_processor(
        BatchSpanExporter(OTLPSpanExporter(endpoint=endpoint))
    )
    trace.set_tracer_provider(trace_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint=endpoint),
        export_interval_millis=30000
    )
    meter_provider = MeterProvider(metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name), metrics.get_meter(service_name)

tracer, meter = setup_telemetry()

# Tao cac metrics instruments
request_counter = meter.create_counter(
    "claude_api.requests.total",
    description="Total number of Claude API requests"
)
error_counter = meter.create_counter(
    "claude_api.errors.total",
    description="Total number of Claude API errors"
)
ttft_histogram = meter.create_histogram(
    "claude_api.ttft.milliseconds",
    description="Time to first token in milliseconds"
)
cost_counter = meter.create_counter(
    "claude_api.cost.usd",
    description="Total cost in USD"
)
token_counter = meter.create_counter(
    "claude_api.tokens.total",
    description="Total tokens processed"
)

Instrumented API Client

Kết hợp tracing và metrics vào API client:

import anthropic
import time

class InstrumentedClaudeClient:
    def __init__(self):
        self.client = anthropic.Anthropic()

    def create_message(self, messages, model="claude-sonnet-4-20250514",
                       max_tokens=1024, feature="default", user_id="unknown"):

        attributes = {
            "claude.model": model,
            "claude.feature": feature,
            "claude.user_id": user_id,
        }

        with tracer.start_as_current_span("claude_api_call", attributes=attributes) as span:
            start = time.monotonic()
            request_counter.add(1, {"model": model, "feature": feature})

            try:
                response = self.client.messages.create(
                    model=model,
                    max_tokens=max_tokens,
                    messages=messages
                )

                elapsed_ms = (time.monotonic() - start) * 1000
                input_tokens = response.usage.input_tokens
                output_tokens = response.usage.output_tokens
                cost = calculate_cost(model, input_tokens, output_tokens)

                ttft_histogram.record(elapsed_ms, {"model": model})
                token_counter.add(input_tokens, {"type": "input", "model": model})
                token_counter.add(output_tokens, {"type": "output", "model": model})
                cost_counter.add(cost["total_cost_usd"], {"model": model, "feature": feature})

                span.set_attribute("claude.input_tokens", input_tokens)
                span.set_attribute("claude.output_tokens", output_tokens)
                span.set_attribute("claude.cost_usd", cost["total_cost_usd"])
                span.set_attribute("claude.latency_ms", elapsed_ms)

                return response

            except anthropic.APIError as e:
                error_counter.add(1, {"model": model, "status_code": str(e.status_code)})
                span.set_attribute("error", True)
                span.set_attribute("error.type", type(e).__name__)
                span.set_attribute("error.status_code", e.status_code)
                raise

Thiết lập Grafana Dashboard

Grafana là lựa chọn phổ biến để visualize metrics từ nhiều nguồn. Dưới đây là các panel quan trọng bạn cần có trong dashboard giám sát Claude API.

Panel 1: Request Overview

Panel tổng quan hiển thị số lượng request, error rate, và latency trong khoảng thời gian chọn. Sử dụng Prometheus query:

# Request rate (requests per second)
rate(claude_api_requests_total[5m])

# Error rate percentage
rate(claude_api_errors_total[5m]) / rate(claude_api_requests_total[5m]) * 100

# P50, P95, P99 latency
histogram_quantile(0.50, rate(claude_api_ttft_milliseconds_bucket[5m]))
histogram_quantile(0.95, rate(claude_api_ttft_milliseconds_bucket[5m]))
histogram_quantile(0.99, rate(claude_api_ttft_milliseconds_bucket[5m]))

Panel 2: Cost Tracking

Biểu đồ chi phí theo thời gian, chia theo model và feature:

# Cumulative cost today
increase(claude_api_cost_usd_total[1d])

# Cost by model
sum by (model) (increase(claude_api_cost_usd_total[1h]))

# Cost by feature
sum by (feature) (increase(claude_api_cost_usd_total[1h]))

# Projected daily cost (based on current rate)
rate(claude_api_cost_usd_total[1h]) * 24

Panel 3: Token Usage

Theo dõi token consumption để phát hiện prompt quá dài hoặc response bất thường:

# Average input tokens per request
rate(claude_api_tokens_total{type="input"}[5m]) / rate(claude_api_requests_total[5m])

# Average output tokens per request
rate(claude_api_tokens_total{type="output"}[5m]) / rate(claude_api_requests_total[5m])

# Token ratio (output/input)
rate(claude_api_tokens_total{type="output"}[5m]) / rate(claude_api_tokens_total{type="input"}[5m])

Panel 4: Rate Limit Utilization

Gauge hiển thị mức sử dụng rate limit hiện tại. Cảnh báo khi vượt 80% để có thời gian xử lý trước khi bị throttle.

Sử dụng Datadog thay thế

Nếu tổ chức bạn đã sử dụng Datadog, quy trình tương tự. Thay vì Prometheus exporter, sử dụng DogStatsD hoặc Datadog APM:

from datadog import statsd

def report_to_datadog(metrics, model, feature):
    tags = [f"model:{model}", f"feature:{feature}"]

    statsd.increment("claude_api.requests", tags=tags)
    statsd.histogram("claude_api.ttft_ms", metrics["ttft_seconds"] * 1000, tags=tags)
    statsd.histogram("claude_api.total_time_ms", metrics["total_time_seconds"] * 1000, tags=tags)
    statsd.increment("claude_api.tokens.input", metrics["input_tokens"], tags=tags)
    statsd.increment("claude_api.tokens.output", metrics["output_tokens"], tags=tags)
    statsd.increment("claude_api.cost_usd", metrics.get("cost_usd", 0), tags=tags)

Thiết lập Alert Rules

Dashboard chỉ hữu ích khi có người xem. Alert rules giúp phát hiện vấn đề tự động và thông báo cho team kịp thời.

Alert 1: Latency Spike

Cảnh báo khi P95 TTFT vượt ngưỡng bình thường. Ngưỡng phụ thuộc vào use case — chatbot cần TTFT thấp hơn batch processing.

# Grafana Alert Rule: Latency Spike
# Condition: P95 TTFT > 5000ms for 5 minutes
# Severity: warning

histogram_quantile(0.95, rate(claude_api_ttft_milliseconds_bucket[5m])) > 5000

# Critical: P95 TTFT > 15000ms for 3 minutes
histogram_quantile(0.95, rate(claude_api_ttft_milliseconds_bucket[3m])) > 15000

Alert 2: Error Burst

Phát hiện khi error rate tăng đột biến, đặc biệt phân biệt giữa client errors (4xx) và server errors (5xx):

# Error rate exceeds 5% for 3 minutes
rate(claude_api_errors_total[3m]) / rate(claude_api_requests_total[3m]) > 0.05

# Specific: Rate limit errors spike (429s)
rate(claude_api_errors_total{status_code="429"}[5m]) > 0.1

# Server errors (500/529) - indicates Anthropic-side issues
rate(claude_api_errors_total{status_code=~"5.."}[3m]) > 0

Alert 3: Cost Anomaly

Phát hiện chi phí bất thường — có thể do bug trong prompt construction gây token explosion hoặc sử dụng sai model:

# Hourly cost exceeds budget threshold
increase(claude_api_cost_usd_total[1h]) > 50

# Daily projected cost exceeds budget
rate(claude_api_cost_usd_total[1h]) * 24 > 500

# Average cost per request abnormally high
increase(claude_api_cost_usd_total[5m]) / increase(claude_api_requests_total[5m]) > 0.10

Alert 4: Rate Limit Approaching

# Rate limit utilization exceeds 80%
(claude_api_ratelimit_tokens_limit - claude_api_ratelimit_tokens_remaining)
  / claude_api_ratelimit_tokens_limit > 0.80

Thiết lập notification channels

Cấu hình alert gửi qua nhiều kênh tùy mức độ nghiêm trọng:

Warning: Gửi qua Slack channel #claude-api-alerts
Critical: Gửi qua Slack + PagerDuty on-call
Cost alerts: Gửi qua email cho engineering lead và finance team

Định nghĩa SLO (Service Level Objectives)

SLO là cam kết về chất lượng dịch vụ mà team engineering đặt ra cho hệ thống. Với ứng dụng Claude API, bạn nên định nghĩa SLO cho các khía cạnh sau:

SLO 1: Availability

Tỷ lệ request thành công trong khoảng thời gian. Lưu ý: availability ở đây bao gồm cả phía Anthropic, nên target không nên quá cao.

Target: 99.5% requests thành công trong 30 ngày (cho phép khoảng 3.6 giờ downtime/tháng)
Measurement: 1 - (error_requests / total_requests), tính trên rolling 30 ngày
Exclusions: Không tính 400 errors (lỗi client-side)

SLO 2: Latency

Target: P95 TTFT dưới 3 giây cho interactive use cases
Target: P99 total response time dưới 30 giây

SLO 3: Cost Efficiency

Target: Chi phí trung bình mỗi request dưới 0.05 USD
Target: Chi phí hàng tháng không vượt 120% budget đã duyệt

Error Budget

Error budget là phần SLO còn lại cho phép fail. Với SLO 99.5% availability trong 30 ngày, error budget là 0.5% = khoảng 2,160 failed requests trên 432,000 total requests. Khi error budget sắp hết, team nên:

Dừng deploy feature mới, tập trung fix reliability
Tăng số lượng on-call engineers
Review và cải thiện retry logic, circuit breaker

SLO Burn Rate Dashboard

Burn rate cho biết tốc độ tiêu thụ error budget. Nếu burn rate = 1, bạn đang tiêu thụ error budget đúng theo kế hoạch. Nếu burn rate = 2, bạn đang tiêu thụ gấp đôi và sẽ hết error budget trong 15 ngày thay vì 30 ngày.

# SLO burn rate calculation
# Window: 1 hour
# Budget consumption rate
(1 - (sum(rate(claude_api_requests_total[1h])) - sum(rate(claude_api_errors_total{status_code!~"4.."}[1h])))
  / sum(rate(claude_api_requests_total[1h])))
/ (1 - 0.995)  # 0.995 = SLO target

Thiết lập alert theo burn rate:

Burn rate > 2 trong 1 giờ: Warning — đang tiêu thụ error budget nhanh
Burn rate > 6 trong 5 phút: Critical — có sự cố nghiêm trọng, cần xử lý ngay
Burn rate > 10 trong 2 phút: Page on-call — hệ thống có thể đang down

Production Checklist

Trước khi go-live, đảm bảo bạn đã thiết lập đầy đủ các thành phần sau:

Metrics Collection

TTFT, TPS, total latency được thu thập cho mọi request
Error rate phân loại theo status code
Token usage (input và output) theo model, feature, user
Cost per request được tính và lưu trữ
Rate limit utilization được giám sát

Logging

Structured JSON logs với request_id, timestamp, model, tokens, latency
Không log nội dung prompt/response (bảo mật)
Log rotation và retention policy đã cấu hình
Log aggregation (ELK Stack, Loki, CloudWatch) đã thiết lập

Alerting

Latency spike alerts (warning + critical)
Error burst alerts phân biệt 4xx và 5xx
Cost anomaly alerts (hourly và daily)
Rate limit approaching alerts
Notification channels đã cấu hình và test

Dashboard

Request overview panel
Cost tracking panel
Token usage panel
Rate limit utilization panel
SLO burn rate panel

Tối ưu hóa dựa trên dữ liệu observability

Sau khi có dữ liệu, bạn có thể đưa ra quyết định tối ưu chính xác. Đây là lúc observability chuyển từ "biết chuyện gì đang xảy ra" sang "biết cần làm gì".

Model Selection theo use case

Phân tích cost và quality theo feature để chọn model phù hợp. Ví dụ: feature tóm tắt văn bản đơn giản có thể dùng Haiku thay vì Sonnet, giảm chi phí 75% mà chất lượng vẫn đáp ứng. Dashboard cho thấy feature nào đang dùng model đắt mà không cần thiết, giúp bạn tối ưu chi phí mà không ảnh hưởng đến trải nghiệm người dùng.

Prompt Optimization

Metrics cho thấy average input tokens cao bất thường ở một feature cụ thể? Có thể prompt đang bao gồm quá nhiều context không cần thiết. Tối ưu prompt dựa trên dữ liệu thực tế, không phải phỏng đoán. So sánh input token count trước và sau khi tối ưu để đo lường hiệu quả.

Caching Strategy

Phân tích request patterns để xác định đâu là ứng viên tốt cho prompt caching. Nếu một system prompt dài được sử dụng lặp lại, bật prompt caching có thể giảm chi phí và latency đáng kể. Anthropic hỗ trợ prompt caching cho các prefix prompt giống nhau — metrics sẽ cho thấy bao nhiêu phần trăm request có thể hưởng lợi từ tính năng này.

Capacity Planning

Dựa trên trend request volume và token usage, dự đoán khi nào cần nâng rate limit tier với Anthropic. Chuẩn bị trước thay vì chờ đến khi bị throttle. Một cách tiếp cận hiệu quả là xây dựng forecast dựa trên dữ liệu sử dụng 30 ngày gần nhất, ngoại suy thêm 30-60 ngày để lên kế hoạch.

Anomaly Detection

Ngoài các alert cố định, thiết lập anomaly detection để phát hiện các pattern bất thường mà alert thông thường không bắt được. Ví dụ: một user cụ thể tăng sử dụng gấp 10 lần trong 1 ngày có thể là dấu hiệu của automation loop không mong muốn hoặc là dấu hiệu abuse. Kết hợp dữ liệu observability với business context để phân biệt giữa tăng trưởng tự nhiên và vấn đề cần xử lý.

Bước tiếp theo

Observability là nền tảng cho mọi hệ thống production đáng tin cậy. Bắt đầu với những metrics cơ bản nhất (latency, error rate, cost), sau đó mở rộng dần khi hiểu rõ hơn về patterns sử dụng của ứng dụng. Tham khảo thêm các hướng dẫn về xây dựng ứng dụng Claude API quy mô lớn tại Thư viện Nâng cao.

Tính năng liên quan:Metrics Logging Alerting OpenTelemetry Grafana

Bai viet co huu ich khong?

Writer cho nền tảng kiến thức Claude AI cho người Việt. Software engineer với hơn 20 năm kinh nghiệm, đam mê AI và chia sẻ kiến thức công nghệ.

5 bài viết · 16K lượt đọc

Bình luận (0)

Đăng nhập để bình luận...

Đăng nhập để bình luận

Đang tải bình luận...

Gợi ý cho bạn

Triển khai Claude trên AWS Bedrock — Hướng dẫn production cho doanh nghiệp