Nâng caoHướng dẫnClaude APINguồn: Anthropic

Kiểm soát chi phí AI Agent — Token budget, max iterations và monitoring

Minh TuấnCTO, Transform GroupTheo dõi

28/03/2026 136 0 13 phút đọc

Nghe bài viết

00:00

1 Khác với single prompt hay prompt chain có chi phí dự đoán được, agent loop có thể chạy hàng chục vòng, mỗi vòng tích lũy thêm context token, và tổng chi phí có thể gấp 50-100 lần so với dự kiến.
2 Early exit khi có đủ thông tin Thêm cơ chế để model có thể dừng sớm thay vì chạy hết số vòng cho phép.
3 Cách đơn giản nhất là thêm hướng dẫn trong system prompt: COST_AWARE_SYSTEM_PROMPT = """Ban la tro ly AI.
4 # Vong 10: ~7500 input, 500 output # Vong 20: ~14500 input, 500 output # Tong token sau 20 vong: # Input: 1000 + 1700 + 2400 + ...
5 Điều này giảm chi phí trung bình cho mỗi request.

AI agent là công cụ mạnh mẽ, nhưng đi kèm với một rủi ro nghiêm trọng: chi phí không kiểm soát được. Khác với single prompt hay prompt chain có chi phí dự đoán được, agent loop có thể chạy hàng chục vòng, mỗi vòng tích lũy thêm context token, và tổng chi phí có thể gấp 50-100 lần so với dự kiến. Bài viết này hướng dẫn xây dựng hệ thống kiểm soát chi phí hoàn chỉnh cho AI agent trong môi trường production.

Vấn đề: Chi phí bùng nổ của Agent

Để hiểu tại sao agent có thể gây bùng nổ chi phí, hãy phân tích cơ chế tích lũy token trong agent loop:

# Minh hoa van de tich luy token trong agent loop
# Moi vong, TOAN BO lich su hoi thoai truoc do duoc gui lai

# Vong 1: 1000 input tokens, 500 output tokens
# Vong 2: 1000 + 500 + 200 = 1700 input, 500 output
# Vong 3: 1700 + 500 + 200 = 2400 input, 500 output
# Vong 4: 2400 + 500 + 200 = 3100 input, 500 output
# ...
# Vong 10: ~7500 input, 500 output
# Vong 20: ~14500 input, 500 output

# Tong token sau 20 vong:
# Input: 1000 + 1700 + 2400 + ... + 14500 = ~155,000 tokens
# Output: 20 * 500 = 10,000 tokens

# Chi phi voi Claude Sonnet:
# Input: 155,000 * $3/1M = $0.465
# Output: 10,000 * $15/1M = $0.15
# Tong: $0.615 cho MOT request cua MOT nguoi dung

# Neu 1000 nguoi dung/ngay, moi nguoi 5 request:
# $0.615 * 5000 = $3,075/ngay = $92,250/thang

Con số trên cho thấy tại sao kiểm soát chi phí agent không phải "nice to have" mà là "must have" trước khi deploy ra production.

Giới hạn cơ bản: max_turns và max_tokens

Hai giới hạn đầu tiên và quan trọng nhất cần đặt cho mọi agent:

def agent_loop_with_limits(
    client,
    user_message,
    tools,
    max_turns=10,
    max_tokens_per_turn=2048,
    max_total_tokens=50000
):
    """Agent loop voi gioi han so vong va token."""
    messages = [{"role": "user", "content": user_message}]
    total_input_tokens = 0
    total_output_tokens = 0

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=max_tokens_per_turn,
            tools=tools,
            messages=messages
        )

        # Cap nhat token count
        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
        total_tokens = total_input_tokens + total_output_tokens

        # Kiem tra gioi han token tong
        if total_tokens > max_total_tokens:
            print(f"Da dat gioi han token: {total_tokens}/{max_total_tokens}")
            # Yeu cau model tong ket ngay lap tuc
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": "Ban da dung het token budget. "
                           "Hay tong ket ket qua hien tai ngay."
            })
            final = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages
            )
            return {
                "response": final.content[0].text,
                "reason": "token_budget_exceeded",
                "total_tokens": total_tokens,
                "turns": turn + 1
            }

        # Kiem tra model da hoan thanh chua
        if response.stop_reason == "end_turn":
            text = ""
            for block in response.content:
                if block.type == "text":
                    text += block.text
            return {
                "response": text,
                "reason": "completed",
                "total_tokens": total_tokens,
                "turns": turn + 1
            }

        # Xu ly tool calls
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})

    # Da het so vong lap
    return {
        "response": "Da dat gioi han so vong lap.",
        "reason": "max_turns_exceeded",
        "total_tokens": total_tokens,
        "turns": max_turns
    }

Token Budget theo loại tác vụ

Không phải mọi tác vụ đều cần cùng một ngân sách. Thiết lập budget khác nhau cho từng loại tác vụ:

TASK_BUDGETS = {
    "simple_query": {
        "max_turns": 3,
        "max_tokens_per_turn": 1024,
        "max_total_tokens": 5000,
        "model": "claude-haiku-3-5-20241022",
        "estimated_cost_usd": 0.01
    },
    "data_analysis": {
        "max_turns": 8,
        "max_tokens_per_turn": 2048,
        "max_total_tokens": 30000,
        "model": "claude-sonnet-4-20250514",
        "estimated_cost_usd": 0.15
    },
    "complex_research": {
        "max_turns": 15,
        "max_tokens_per_turn": 4096,
        "max_total_tokens": 100000,
        "model": "claude-sonnet-4-20250514",
        "estimated_cost_usd": 0.50
    },
    "code_generation": {
        "max_turns": 10,
        "max_tokens_per_turn": 4096,
        "max_total_tokens": 60000,
        "model": "claude-sonnet-4-20250514",
        "estimated_cost_usd": 0.25
    }
}

def classify_task(user_message):
    """Phan loai tac vu de chon budget phu hop."""
    # Dung Haiku de phan loai (re va nhanh)
    response = client.messages.create(
        model="claude-haiku-3-5-20241022",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"Phan loai request sau vao 1 trong 4 loai: "
                       f"simple_query, data_analysis, complex_research, "
                       f"code_generation. Chi tra loi ten loai.

"
                       f"Request: {user_message}"
        }]
    )
    task_type = response.content[0].text.strip().lower()

    if task_type in TASK_BUDGETS:
        return task_type
    return "simple_query"  # Mac dinh

def run_agent_with_budget(client, user_message, tools):
    """Chay agent voi budget tu dong dua tren loai tac vu."""
    task_type = classify_task(user_message)
    budget = TASK_BUDGETS[task_type]

    print(f"Task type: {task_type}")
    print(f"Budget: {budget['max_total_tokens']} tokens, "
          f"{budget['max_turns']} turns, "
          f"model: {budget['model']}")

    return agent_loop_with_limits(
        client=client,
        user_message=user_message,
        tools=tools,
        max_turns=budget["max_turns"],
        max_tokens_per_turn=budget["max_tokens_per_turn"],
        max_total_tokens=budget["max_total_tokens"]
    )

Model Cascade: Bắt đầu rẻ, nâng cấp khi cần

Model cascade là chiến lược bắt đầu với model rẻ nhất (Haiku) và chỉ nâng lên model đắt hơn (Sonnet, Opus) khi tác vụ thực sự cần. Điều này giảm chi phí trung bình cho mỗi request.

class ModelCascade:
    """Bat dau voi model re, nang cap khi can thiet."""

    MODELS = [
        {
            "name": "claude-haiku-3-5-20241022",
            "input_price": 0.80 / 1_000_000,
            "output_price": 4.0 / 1_000_000,
            "max_complexity": "low"
        },
        {
            "name": "claude-sonnet-4-20250514",
            "input_price": 3.0 / 1_000_000,
            "output_price": 15.0 / 1_000_000,
            "max_complexity": "high"
        },
        {
            "name": "claude-opus-4-20250514",
            "input_price": 15.0 / 1_000_000,
            "output_price": 75.0 / 1_000_000,
            "max_complexity": "very_high"
        }
    ]

    def __init__(self, client):
        self.client = client

    def call_with_cascade(self, messages, tools=None, **kwargs):
        """Thu model re truoc, nang cap neu ket qua khong du tot."""

        # Buoc 1: Thu Haiku truoc
        haiku_response = self.client.messages.create(
            model=self.MODELS[0]["name"],
            max_tokens=kwargs.get("max_tokens", 2048),
            messages=messages,
            tools=tools
        )

        # Kiem tra chat luong response
        if self._is_quality_sufficient(haiku_response, messages):
            return {
                "response": haiku_response,
                "model_used": self.MODELS[0]["name"],
                "escalated": False
            }

        # Buoc 2: Nang cap len Sonnet
        print("Haiku khong du, nang cap len Sonnet")
        sonnet_response = self.client.messages.create(
            model=self.MODELS[1]["name"],
            max_tokens=kwargs.get("max_tokens", 2048),
            messages=messages,
            tools=tools
        )

        return {
            "response": sonnet_response,
            "model_used": self.MODELS[1]["name"],
            "escalated": True
        }

    def _is_quality_sufficient(self, response, messages):
        """Kiem tra chat luong response co du tot khong."""
        content = response.content[0].text if response.content else ""

        # Tieu chi danh gia co ban
        if len(content) < 50:
            return False
        if response.stop_reason == "max_tokens":
            return False
        if "toi khong the" in content.lower():
            return False
        if "xin loi" in content.lower() and len(content) < 200:
            return False

        return True

CostController Class hoàn chỉnh

Dưới đây là class CostController tích hợp tất cả các kỹ thuật kiểm soát chi phí, sẵn sàng dùng trong production:

import time
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("cost_controller")

@dataclass
class CostRecord:
    """Ghi nhan chi phi cua mot agent run."""
    run_id: str
    task_type: str
    model: str
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    turns: int = 0
    cost_usd: float = 0.0
    start_time: float = field(default_factory=time.time)
    end_time: Optional[float] = None
    status: str = "running"

class CostController:
    """Kiem soat chi phi AI agent trong production."""

    PRICING = {
        "claude-haiku-3-5-20241022": {
            "input": 0.80 / 1_000_000,
            "output": 4.0 / 1_000_000,
            "cache_read": 0.08 / 1_000_000,
            "cache_write": 1.0 / 1_000_000
        },
        "claude-sonnet-4-20250514": {
            "input": 3.0 / 1_000_000,
            "output": 15.0 / 1_000_000,
            "cache_read": 0.30 / 1_000_000,
            "cache_write": 3.75 / 1_000_000
        },
        "claude-opus-4-20250514": {
            "input": 15.0 / 1_000_000,
            "output": 75.0 / 1_000_000,
            "cache_read": 1.50 / 1_000_000,
            "cache_write": 18.75 / 1_000_000
        }
    }

    def __init__(
        self,
        max_cost_per_run=1.0,
        max_cost_per_hour=50.0,
        max_cost_per_day=500.0,
        alert_threshold_pct=0.8
    ):
        self.max_cost_per_run = max_cost_per_run
        self.max_cost_per_hour = max_cost_per_hour
        self.max_cost_per_day = max_cost_per_day
        self.alert_threshold_pct = alert_threshold_pct

        self.active_runs = {}
        self.completed_runs = []
        self.hourly_cost = 0.0
        self.daily_cost = 0.0
        self.last_hour_reset = time.time()
        self.last_day_reset = time.time()

    def start_run(self, run_id, task_type, model):
        """Bat dau theo doi mot agent run."""
        self._reset_windows()

        # Kiem tra budget toan cuc
        if self.hourly_cost >= self.max_cost_per_hour:
            raise BudgetExceededError(
                f"Vuot budget theo gio: "
                f"${self.hourly_cost:.2f}/${self.max_cost_per_hour:.2f}"
            )
        if self.daily_cost >= self.max_cost_per_day:
            raise BudgetExceededError(
                f"Vuot budget theo ngay: "
                f"${self.daily_cost:.2f}/${self.max_cost_per_day:.2f}"
            )

        record = CostRecord(
            run_id=run_id,
            task_type=task_type,
            model=model
        )
        self.active_runs[run_id] = record
        return record

    def record_usage(self, run_id, usage, model=None):
        """Ghi nhan token usage sau moi API call."""
        record = self.active_runs.get(run_id)
        if not record:
            return

        model = model or record.model
        pricing = self.PRICING.get(model, self.PRICING["claude-sonnet-4-20250514"])

        # Cap nhat token counts
        record.input_tokens += usage.input_tokens
        record.output_tokens += usage.output_tokens

        cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
        cache_write = getattr(usage, "cache_creation_input_tokens", 0) or 0
        record.cache_read_tokens += cache_read
        record.cache_write_tokens += cache_write

        record.turns += 1

        # Tinh chi phi
        cost = (
            usage.input_tokens * pricing["input"] +
            usage.output_tokens * pricing["output"] +
            cache_read * pricing["cache_read"] +
            cache_write * pricing["cache_write"]
        )
        record.cost_usd += cost
        self.hourly_cost += cost
        self.daily_cost += cost

        # Kiem tra alert threshold
        if record.cost_usd >= self.max_cost_per_run * self.alert_threshold_pct:
            logger.warning(
                f"Run {run_id}: chi phi dat "
                f"{record.cost_usd / self.max_cost_per_run:.0%} budget "
                f"(${record.cost_usd:.4f}/${self.max_cost_per_run:.2f})"
            )

        # Kiem tra vuot budget
        if record.cost_usd >= self.max_cost_per_run:
            record.status = "budget_exceeded"
            raise BudgetExceededError(
                f"Run {run_id} vuot budget: "
                f"${record.cost_usd:.4f} >= ${self.max_cost_per_run:.2f}"
            )

    def end_run(self, run_id, status="completed"):
        """Ket thuc theo doi agent run."""
        record = self.active_runs.pop(run_id, None)
        if record:
            record.end_time = time.time()
            record.status = status
            self.completed_runs.append(record)
            logger.info(
                f"Run {run_id} hoan thanh: "
                f"${record.cost_usd:.4f}, "
                f"{record.turns} turns, "
                f"{record.input_tokens + record.output_tokens} tokens"
            )
        return record

    def get_dashboard(self):
        """Tra ve thong ke chi phi tong hop."""
        self._reset_windows()

        all_runs = self.completed_runs[-100:]
        if not all_runs:
            return {"message": "Chua co du lieu"}

        total_cost = sum(r.cost_usd for r in all_runs)
        avg_cost = total_cost / len(all_runs)
        max_cost_run = max(all_runs, key=lambda r: r.cost_usd)

        costs_by_task = {}
        for run in all_runs:
            if run.task_type not in costs_by_task:
                costs_by_task[run.task_type] = {
                    "count": 0, "total_cost": 0.0
                }
            costs_by_task[run.task_type]["count"] += 1
            costs_by_task[run.task_type]["total_cost"] += run.cost_usd

        costs_by_model = {}
        for run in all_runs:
            if run.model not in costs_by_model:
                costs_by_model[run.model] = {
                    "count": 0, "total_cost": 0.0
                }
            costs_by_model[run.model]["count"] += 1
            costs_by_model[run.model]["total_cost"] += run.cost_usd

        return {
            "summary": {
                "total_runs": len(all_runs),
                "total_cost_usd": round(total_cost, 4),
                "avg_cost_per_run": round(avg_cost, 4),
                "max_cost_run": {
                    "run_id": max_cost_run.run_id,
                    "cost": round(max_cost_run.cost_usd, 4)
                }
            },
            "current_budget": {
                "hourly": f"${self.hourly_cost:.2f}/${self.max_cost_per_hour:.2f}",
                "daily": f"${self.daily_cost:.2f}/${self.max_cost_per_day:.2f}"
            },
            "by_task_type": costs_by_task,
            "by_model": costs_by_model
        }

    def _reset_windows(self):
        """Reset time windows neu can."""
        now = time.time()
        if now - self.last_hour_reset >= 3600:
            self.hourly_cost = 0.0
            self.last_hour_reset = now
        if now - self.last_day_reset >= 86400:
            self.daily_cost = 0.0
            self.last_day_reset = now


class BudgetExceededError(Exception):
    pass

Tích hợp CostController vào Agent Loop

Dưới đây là cách tích hợp CostController vào agent loop hoàn chỉnh:

import uuid

def production_agent(
    client,
    user_message,
    tools,
    cost_controller,
    task_type="data_analysis"
):
    """Agent loop production-ready voi cost control."""
    run_id = str(uuid.uuid4())[:8]
    budget = TASK_BUDGETS.get(task_type, TASK_BUDGETS["simple_query"])

    # Bat dau tracking
    cost_controller.start_run(
        run_id=run_id,
        task_type=task_type,
        model=budget["model"]
    )

    messages = [{"role": "user", "content": user_message}]

    try:
        for turn in range(budget["max_turns"]):
            response = client.messages.create(
                model=budget["model"],
                max_tokens=budget["max_tokens_per_turn"],
                tools=tools,
                messages=messages
            )

            # Ghi nhan chi phi
            try:
                cost_controller.record_usage(run_id, response.usage)
            except BudgetExceededError:
                # Yeu cau model tong ket
                messages.append(
                    {"role": "assistant", "content": response.content}
                )
                messages.append({
                    "role": "user",
                    "content": "Budget da het. Hay tong ket ket qua hien tai."
                })
                final = client.messages.create(
                    model="claude-haiku-3-5-20241022",
                    max_tokens=512,
                    messages=messages
                )
                cost_controller.end_run(run_id, "budget_exceeded")
                return {
                    "response": final.content[0].text,
                    "status": "budget_exceeded",
                    "run_id": run_id
                }

            # Model hoan thanh
            if response.stop_reason == "end_turn":
                text = ""
                for block in response.content:
                    if block.type == "text":
                        text += block.text
                cost_controller.end_run(run_id, "completed")
                return {
                    "response": text,
                    "status": "completed",
                    "run_id": run_id
                }

            # Xu ly tool calls
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})

        cost_controller.end_run(run_id, "max_turns")
        return {
            "response": "Da dat gioi han so vong.",
            "status": "max_turns_exceeded",
            "run_id": run_id
        }

    except Exception as e:
        cost_controller.end_run(run_id, f"error: {str(e)}")
        raise

Alert Thresholds và Notification

Thiết lập nhiều tầng cảnh báo để phát hiện vấn đề chi phí sớm:

class CostAlertManager:
    """Quan ly canh bao chi phi nhieu tang."""

    def __init__(self):
        self.thresholds = [
            {"level": "info", "pct": 0.5,
             "message": "Da dung 50% budget"},
            {"level": "warning", "pct": 0.8,
             "message": "Da dung 80% budget - can chu y"},
            {"level": "critical", "pct": 0.95,
             "message": "Da dung 95% budget - sap het"},
            {"level": "emergency", "pct": 1.0,
             "message": "Da vuot budget!"}
        ]
        self.triggered = set()

    def check(self, current_cost, budget, run_id):
        """Kiem tra va gui canh bao neu can."""
        ratio = current_cost / budget if budget > 0 else 0

        for threshold in self.thresholds:
            key = f"{run_id}_{threshold['level']}"
            if ratio >= threshold["pct"] and key not in self.triggered:
                self.triggered.add(key)
                self._send_alert(
                    level=threshold["level"],
                    message=f"Run {run_id}: {threshold['message']} "
                            f"(${current_cost:.4f}/${budget:.2f})",
                    ratio=ratio
                )

    def _send_alert(self, level, message, ratio):
        """Gui canh bao qua cac kenh khac nhau."""
        logger.log(
            logging.CRITICAL if level in ("critical", "emergency")
            else logging.WARNING,
            message
        )

        # Tich hop voi Slack
        # if level in ("critical", "emergency"):
        #     slack_webhook.send(message)

        # Tich hop voi PagerDuty
        # if level == "emergency":
        #     pagerduty.trigger(message)

Dashboard chi phí đơn giản

Xây dựng dashboard text-based để theo dõi chi phí trong quá trình phát triển và testing:

def print_cost_dashboard(cost_controller):
    """In dashboard chi phi ra console."""
    dashboard = cost_controller.get_dashboard()
    if "message" in dashboard:
        print(dashboard["message"])
        return

    s = dashboard["summary"]
    print("=" * 60)
    print("           COST DASHBOARD")
    print("=" * 60)
    print(f"Tong runs:        {s['total_runs']}")
    print(f"Tong chi phi:     ${s['total_cost_usd']}")
    print(f"Trung binh/run:   ${s['avg_cost_per_run']}")
    print(f"Run dat nhat:     {s['max_cost_run']['run_id']} "
          f"(${s['max_cost_run']['cost']})")
    print()

    b = dashboard["current_budget"]
    print(f"Budget gio nay:   {b['hourly']}")
    print(f"Budget hom nay:   {b['daily']}")
    print()

    print("Chi phi theo task type:")
    for task, data in dashboard["by_task_type"].items():
        avg = data["total_cost"] / data["count"] if data["count"] > 0 else 0
        print(f"  {task}: {data['count']} runs, "
              f"${data['total_cost']:.4f} total, "
              f"${avg:.4f} avg")
    print()

    print("Chi phi theo model:")
    for model, data in dashboard["by_model"].items():
        short_name = model.split("-")[1]
        print(f"  {short_name}: {data['count']} runs, "
              f"${data['total_cost']:.4f}")
    print("=" * 60)

Chiến lược giảm chi phí thực tế

Ngoài việc đặt giới hạn, có nhiều kỹ thuật giúp giảm chi phí thực sự cho agent:

1. Tóm tắt context thay vì giữ nguyên

Thay vì tích lũy toàn bộ lịch sử hội thoại, tóm tắt các vòng cũ để giảm input token:

def summarize_old_turns(client, messages, keep_recent=4):
    """Tom tat cac tin nhan cu de giam token."""
    if len(messages) <= keep_recent * 2:
        return messages

    old_messages = messages[:-keep_recent * 2]
    recent_messages = messages[-keep_recent * 2:]

    # Tom tat phan cu
    summary_response = client.messages.create(
        model="claude-haiku-3-5-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": "Tom tat ngan gon noi dung hoi thoai sau, "
                       "giu lai thong tin quan trong:

"
                       + "
".join(
                           f"{m['role']}: {m['content']}"
                           for m in old_messages
                           if isinstance(m.get("content"), str)
                       )
        }]
    )

    summary = summary_response.content[0].text

    # Thay the lich su cu bang tom tat
    compressed = [
        {"role": "user", "content": f"[Tom tat hoi thoai truoc: {summary}]"},
        {"role": "assistant", "content": "Da hieu. Toi se tiep tuc dua tren "
                                         "noi dung tom tat."}
    ] + recent_messages

    return compressed

2. Tool result truncation

Giới hạn kích thước kết quả trả về từ tool để tránh token bloat:

def execute_tool_with_limit(tool_name, tool_input, max_chars=2000):
    """Thuc thi tool va gioi han kich thuoc ket qua."""
    result = execute_tool(tool_name, tool_input)

    if len(result) > max_chars:
        truncated = result[:max_chars]
        truncated += f"
... (da cat bot {len(result) - max_chars} ky tu)"
        return truncated

    return result

3. Early exit khi có đủ thông tin

Thêm cơ chế để model có thể dừng sớm thay vì chạy hết số vòng cho phép. Cách đơn giản nhất là thêm hướng dẫn trong system prompt:

COST_AWARE_SYSTEM_PROMPT = """Ban la tro ly AI. Quy tac ve hieu suat:
1. Chi goi tool khi THAT SU can thiet
2. Neu da co du thong tin de tra loi, tra loi NGAY, khong goi them tool
3. Uu tien goi it tool nhat co the
4. Neu tool tra ve loi, khong retry qua 2 lan
5. Khi duoc yeu cau tong ket, tra loi ngan gon trong 1 tin nhan"""

Checklist kiểm soát chi phí production

Trước khi deploy agent ra production, hãy đảm bảo tất cả các mục sau:

max_turns được đặt cho mọi agent loop (khuyến nghị: 5-15 tùy tác vụ)
max_tokens_per_turn được đặt hợp lý (không dùng giá trị mặc định quá cao)
Token budget tổng cho mỗi agent run
Budget theo giờ và theo ngày để ngăn bùng nổ chi phí
Model cascade: bắt đầu với Haiku, nâng lên khi cần
Context summarization cho conversation dài
Tool result truncation để giới hạn input token
Alert thresholds ở nhiều mức (50%, 80%, 95%)
Cost dashboard để theo dõi xu hướng chi phí
Graceful degradation khi hết budget (tóm tắt thay vì crash)

Bước tiếp theo

Kiểm soát chi phí là yếu tố sống còn khi triển khai AI agent trong production. Kết hợp CostController với error handling vững chắc và Prompt Caching, bạn sẽ có hệ thống agent vừa mạnh mẽ vừa kinh tế. Khám phá thêm tại Thư viện Nâng cao Claude.

Tính năng liên quan:Cost Control Token Budget Model Cascade Cost Monitoring Agent Safety

Bai viet co huu ich khong?

Writer cho nền tảng kiến thức Claude AI cho người Việt. Software engineer với hơn 20 năm kinh nghiệm, đam mê AI và chia sẻ kiến thức công nghệ.

5 bài viết · 16K lượt đọc

Bình luận (0)

Đăng nhập để bình luận...

Đăng nhập để bình luận

Đang tải bình luận...

Gợi ý cho bạn

Usage & Cost API — Theo dõi chi phí Claude API real-time

Kiểm soát chi phí AI Agent — Token budget, max iterations và monitoring

Điểm nổi bật

Vấn đề: Chi phí bùng nổ của Agent

Giới hạn cơ bản: max_turns và max_tokens

Token Budget theo loại tác vụ

Model Cascade: Bắt đầu rẻ, nâng cấp khi cần

CostController Class hoàn chỉnh

Tích hợp CostController vào Agent Loop

Alert Thresholds và Notification

Dashboard chi phí đơn giản

Chiến lược giảm chi phí thực tế

1. Tóm tắt context thay vì giữ nguyên

2. Tool result truncation

3. Early exit khi có đủ thông tin

Checklist kiểm soát chi phí production

Bước tiếp theo

Gợi ý cho bạn

Usage & Cost API — Theo dõi chi phí Claude API real-time

Human-in-the-Loop — Khi nào AI Agent cần xin phép con người

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

Xây dựng LLM Agent từ đầu — Reference Implementation

Tin liên quan nên xem

Agent Loop vs Prompt Chaining — Chọn pattern đúng cho bài toán của bạn

Claude System Prompt Mastery — Thiết kế system prompt production-grade

Đọc biểu đồ, đồ thị và slide deck với Claude Vision

Agent Loop — Nền tảng xây dựng AI Agent với Claude

Kiểm soát chi phí AI Agent — Token budget, max iterations và monitoring

Điểm nổi bật

Vấn đề: Chi phí bùng nổ của Agent

Giới hạn cơ bản: max_turns và max_tokens

Token Budget theo loại tác vụ

Model Cascade: Bắt đầu rẻ, nâng cấp khi cần

CostController Class hoàn chỉnh

Tích hợp CostController vào Agent Loop

Alert Thresholds và Notification

Dashboard chi phí đơn giản

Chiến lược giảm chi phí thực tế

1. Tóm tắt context thay vì giữ nguyên

2. Tool result truncation

3. Early exit khi có đủ thông tin

Checklist kiểm soát chi phí production

Bước tiếp theo

Gợi ý cho bạn

Usage & Cost API — Theo dõi chi phí Claude API real-time

Human-in-the-Loop — Khi nào AI Agent cần xin phép con người

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

Xây dựng LLM Agent từ đầu — Reference Implementation

Tin liên quan nên xem

Agent Loop vs Prompt Chaining — Chọn pattern đúng cho bài toán của bạn

Claude System Prompt Mastery — Thiết kế system prompt production-grade

Đọc biểu đồ, đồ thị và slide deck với Claude Vision

Agent Loop — Nền tảng xây dựng AI Agent với Claude

Đăng ký nhận bản tin