Nâng caoHướng dẫnClaude APINguồn: Anthropic

Testing AI Agent — Framework đánh giá và kiểm thử agent production

Minh TuấnCTO, Transform GroupTheo dõi

28/03/2026 618 0 14 phút đọc

Nghe bài viết

00:00

1 AI Agent khác biệt căn bản so với phần mềm truyền thống: output không xác định trước, hành vi thay đổi theo context, và "đúng" có nhiều mức độ thay vì chỉ true/false.
2 Tool interaction uncertainty Agent gọi external tools (API, database, web) mà kết quả thay đổi theo thời gian.
3 Dữ liệu hôm nay khác hôm qua, API có thể timeout hoặc trả kết quả khác.
4 AI Agent phá vỡ nguyên tắc này ở nhiều cấp độ: Non-deterministic output Cùng một câu hỏi, agent có thể trả lời theo nhiều cách khác nhau, tất cả đều đúng.
5 Mỗi bước agent gọi API tốn tokens, và chi phí tăng nhanh khi agent chạy nhiều bước.

brown and black bee on brown tree branch during daytime

AI Agent khác biệt căn bản so với phần mềm truyền thống: output không xác định trước, hành vi thay đổi theo context, và "đúng" có nhiều mức độ thay vì chỉ true/false. Điều này khiến testing agent trở thành một trong những thách thức lớn nhất khi đưa AI vào production. Bài viết này trình bày framework kiểm thử thực tế, từ nguyên tắc cơ bản đến pipeline CI/CD hoàn chỉnh.

Tại sao testing AI Agent khó?

Testing phần mềm truyền thống dựa trên nguyên tắc: cùng input cho ra cùng output. AI Agent phá vỡ nguyên tắc này ở nhiều cấp độ:

Non-deterministic output

Cùng một câu hỏi, agent có thể trả lời theo nhiều cách khác nhau, tất cả đều đúng. Ví dụ: "Thời tiết Hà Nội hôm nay thế nào?" có thể được trả lời bằng "Hà Nội hôm nay 28 độ C, trời nắng" hoặc "Thời tiết tại Hà Nội hiện tại: nhiệt độ 28 độ C, quang mây, độ ẩm 75%". Cả hai đều đúng nhưng text khác nhau hoàn toàn.

Multi-step complexity

Agent thực hiện nhiều bước, mỗi bước phụ thuộc vào kết quả bước trước. Một thay đổi nhỏ ở bước 2 có thể dẫn đến kết quả hoàn toàn khác ở bước 5. Testing từng bước riêng lẻ không đảm bảo toàn bộ pipeline hoạt động đúng.

Tool interaction uncertainty

Agent gọi external tools (API, database, web) mà kết quả thay đổi theo thời gian. Dữ liệu hôm nay khác hôm qua, API có thể timeout hoặc trả kết quả khác.

Evaluation subjectivity

Đánh giá câu trả lời "tốt" hay "xấu" thường chủ quan. Một câu trả lời ngắn gọn có thể được coi là tốt trong context hỗ trợ khách hàng nhưng thiếu chi tiết trong context nghiên cứu.

Test Categories cho AI Agent

Chia testing thành 4 cấp độ, từ cụ thể đến tổng quát:

Level 1: Unit Tests — Test từng component

Test các thành phần riêng lẻ của agent: tool handlers, prompt templates, output parsers, utility functions.

// tests/unit/tool-handlers.test.ts
import { describe, it, expect, vi } from "vitest";
import { handleSearchTool } from "../../src/tools/search.js";

describe("Search Tool Handler", () => {
  it("tra ve ket qua khi tim thay", async () => {
    const mockDb = {
      search: vi.fn().mockResolvedValue([
        { id: "1", title: "Bai viet 1", score: 0.95 },
        { id: "2", title: "Bai viet 2", score: 0.82 },
      ]),
    };

    const result = await handleSearchTool(
      { query: "AI agent", limit: 5 },
      mockDb as any
    );

    expect(JSON.parse(result)).toHaveProperty("results");
    expect(JSON.parse(result).results).toHaveLength(2);
  });

  it("tra ve mang rong khi khong tim thay", async () => {
    const mockDb = {
      search: vi.fn().mockResolvedValue([]),
    };

    const result = await handleSearchTool(
      { query: "xyz khong ton tai", limit: 5 },
      mockDb as any
    );

    expect(JSON.parse(result).results).toHaveLength(0);
  });

  it("xu ly loi database gracefully", async () => {
    const mockDb = {
      search: vi.fn().mockRejectedValue(
        new Error("Connection refused")
      ),
    };

    await expect(
      handleSearchTool({ query: "test", limit: 5 }, mockDb as any)
    ).rejects.toThrow("Connection refused");
  });
});

Level 2: Integration Tests — Test tool orchestration

Test cách agent gọi tools, xử lý kết quả và quyết định bước tiếp theo.

// tests/integration/agent-tools.test.ts
import { describe, it, expect } from "vitest";
import { createTestAgent } from "../helpers/test-agent.js";

describe("Agent Tool Orchestration", () => {
  it("goi dung tool cho cau hoi ve du lieu", async () => {
    const agent = createTestAgent({
      tools: ["search_docs", "query_db", "calculate"],
      mockResponses: {
        search_docs: '{"results": [{"title": "Guide"}]}',
        query_db: '{"rows": [{"count": 42}]}',
      },
    });

    const result = await agent.run(
      "Co bao nhieu don hang trong thang nay?"
    );

    // Agent phai goi query_db, khong phai search_docs
    expect(agent.toolCallLog).toContainEqual(
      expect.objectContaining({ name: "query_db" })
    );
    expect(agent.toolCallLog).not.toContainEqual(
      expect.objectContaining({ name: "search_docs" })
    );
  });

  it("goi nhieu tools khi can thong tin tu nhieu nguon",
    async () => {
    const agent = createTestAgent({
      tools: ["search_docs", "query_db"],
      mockResponses: {
        search_docs: '{"results": [{"content": "Policy A"}]}',
        query_db: '{"rows": [{"revenue": 1000000}]}',
      },
    });

    await agent.run(
      "So sanh doanh thu voi muc tieu trong chinh sach"
    );

    const toolNames = agent.toolCallLog.map(
      (c: any) => c.name
    );
    expect(toolNames).toContain("search_docs");
    expect(toolNames).toContain("query_db");
  });

  it("xu ly tool error va thu lai hoac bao loi",
    async () => {
    const agent = createTestAgent({
      tools: ["search_docs"],
      mockResponses: {
        search_docs: new Error("Service unavailable"),
      },
    });

    const result = await agent.run("Tim tai lieu ve X");

    // Agent phai thong bao loi, khong bao gio bia ket qua
    expect(result.text).toMatch(
      /khong the|loi|khong tim duoc|thu lai/i
    );
  });
});

Level 3: End-to-End Tests — Test toàn bộ pipeline

Test agent từ đầu đến cuối với dữ liệu thực hoặc gần thực, đo lường chất lượng tổng thể.

// tests/e2e/agent-quality.test.ts
import { describe, it, expect } from "vitest";
import { ProductionAgent } from "../../src/agent.js";
import { goldenDataset } from "../fixtures/golden-dataset.js";

describe("Agent End-to-End Quality", () => {
  const agent = new ProductionAgent({
    model: "claude-sonnet-4-20250514",
    // Su dung test database voi du lieu co dinh
    dbUrl: process.env.TEST_DATABASE_URL,
  });

  // Test tung case trong golden dataset
  for (const testCase of goldenDataset) {
    it("[" + testCase.id + "] " + testCase.description, async () => {
      const result = await agent.run(testCase.input);

      // Kiem tra cac dieu kien bat buoc
      for (const assertion of testCase.assertions) {
        switch (assertion.type) {
          case "contains":
            expect(result.text.toLowerCase()).toContain(
              assertion.value.toLowerCase()
            );
            break;
          case "not_contains":
            expect(result.text.toLowerCase()).not.toContain(
              assertion.value.toLowerCase()
            );
            break;
          case "tool_used":
            expect(
              result.toolCalls.map((c: any) => c.name)
            ).toContain(assertion.value);
            break;
          case "max_steps":
            expect(result.steps).toBeLessThanOrEqual(
              assertion.value
            );
            break;
          case "max_tokens":
            expect(result.totalTokens).toBeLessThanOrEqual(
              assertion.value
            );
            break;
        }
      }
    }, 30000); // Timeout 30 giay cho E2E test
  }
});

Level 4: Evaluation Tests — Đánh giá bằng LLM

Sử dụng một LLM khác (hoặc cùng model với prompt khác) để đánh giá chất lượng câu trả lời. Đây là cách tiếp cận "LLM-as-Judge".

// tests/evaluation/llm-judge.test.ts
import Anthropic from "@anthropic-ai/sdk";

const judge = new Anthropic();

async function evaluateAnswer(
  question: string,
  answer: string,
  referenceAnswer: string,
  criteria: string[]
): Promise<{
  score: number;
  reasoning: string;
  criteria_scores: Record<string, number>;
}> {
  const response = await judge.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: "Danh gia chat luong cau tra loi sau.

"
        + "Cau hoi: " + question + "

"
        + "Cau tra loi can danh gia:
" + answer + "

"
        + "Cau tra loi tham chieu (ground truth):
"
        + referenceAnswer + "

"
        + "Tieu chi danh gia:
"
        + criteria.map((c, i) => (i + 1) + ". " + c).join("
")
        + "

Tra ve JSON voi format:
"
        + '{ "score": 0-100, "reasoning": "...", '
        + '"criteria_scores": { ... } }

'
        + "Chi tra ve JSON, khong co text khac."
    }],
  });

  const text = response.content[0].type === "text"
    ? response.content[0].text : "";
  return JSON.parse(text);
}

Golden Dataset — Bộ test chuẩn

Golden dataset là tập hợp các câu hỏi với câu trả lời chuẩn, dùng làm benchmark cho agent. Đây là nền tảng quan trọng nhất của agent testing.

Cấu trúc golden dataset

// tests/fixtures/golden-dataset.ts
export interface GoldenTestCase {
  id: string;
  category: string;
  description: string;
  input: string;
  expected_behavior: {
    should_use_tools: boolean;
    expected_tools?: string[];
    max_steps?: number;
  };
  reference_answer: string;
  assertions: Array<{
    type: "contains" | "not_contains" | "tool_used"
      | "max_steps" | "max_tokens";
    value: any;
    description: string;
  }>;
  difficulty: "easy" | "medium" | "hard";
  tags: string[];
}

export const goldenDataset: GoldenTestCase[] = [
  {
    id: "GD-001",
    category: "product_inquiry",
    description: "Hoi ve thong tin san pham cu the",
    input: "Cho toi biet thong so ky thuat cua MacBook Pro M3",
    expected_behavior: {
      should_use_tools: true,
      expected_tools: ["search_products"],
      max_steps: 3,
    },
    reference_answer: "MacBook Pro M3 co chip M3, "
      + "RAM 18GB, SSD 512GB, man hinh 14.2 inch...",
    assertions: [
      {
        type: "tool_used",
        value: "search_products",
        description: "Phai goi tool tim kiem san pham",
      },
      {
        type: "contains",
        value: "M3",
        description: "Phai de cap den chip M3",
      },
      {
        type: "not_contains",
        value: "toi khong biet",
        description: "Khong duoc noi khong biet neu co du lieu",
      },
      {
        type: "max_steps",
        value: 5,
        description: "Khong duoc qua 5 buoc",
      },
    ],
    difficulty: "easy",
    tags: ["product", "search"],
  },
  {
    id: "GD-002",
    category: "multi_step_analysis",
    description: "Phan tich yeu cau nhieu buoc",
    input: "So sanh doanh thu thang 1 va thang 2, "
      + "va giai thich nguyen nhan chenh lech "
      + "dua tren feedback khach hang",
    expected_behavior: {
      should_use_tools: true,
      expected_tools: ["query_db", "search_feedback"],
      max_steps: 8,
    },
    reference_answer: "Doanh thu T1: 500tr, T2: 650tr, "
      + "tang 30%. Nguyen nhan chinh tu feedback: ...",
    assertions: [
      {
        type: "tool_used",
        value: "query_db",
        description: "Phai query database de lay doanh thu",
      },
      {
        type: "tool_used",
        value: "search_feedback",
        description: "Phai search feedback khach hang",
      },
      {
        type: "contains",
        value: "doanh thu",
        description: "Phai de cap den doanh thu",
      },
      {
        type: "max_steps",
        value: 10,
        description: "Khong duoc qua 10 buoc",
      },
    ],
    difficulty: "hard",
    tags: ["analysis", "multi-step", "data"],
  },
  {
    id: "GD-003",
    category: "boundary",
    description: "Cau hoi ngoai pham vi",
    input: "Lam the nao de nau pho bo?",
    expected_behavior: {
      should_use_tools: false,
      max_steps: 1,
    },
    reference_answer: "Xin loi, toi la tro ly ho tro san pham. "
      + "Toi khong the giup ban ve cong thuc nau an.",
    assertions: [
      {
        type: "not_contains",
        value: "nguyen lieu",
        description: "Khong duoc tra loi cau hoi ngoai pham vi",
      },
      {
        type: "max_steps",
        value: 2,
        description: "Phai tu choi nhanh, khong tim kiem",
      },
    ],
    difficulty: "easy",
    tags: ["boundary", "out-of-scope"],
  },
];

Xây dựng golden dataset hiệu quả

Bắt đầu nhỏ: 20-30 test cases cho MVP, mở rộng dần
Phủ đều categories: Đảm bảo mỗi loại câu hỏi agent xử lý đều có test cases
Bao gồm edge cases: Câu hỏi mơ hồ, ngoài phạm vi, input sai format
Thu thập từ production: Log câu hỏi thực tế, chọn các case đại diện
Version control: Golden dataset phải được quản lý trong git, có changelog

Success Rate Measurement

Đo lường success rate là cách chính để theo dõi chất lượng agent theo thời gian.

// src/metrics/success-rate.ts
export interface TestRun {
  runId: string;
  timestamp: Date;
  model: string;
  results: TestResult[];
}

export interface TestResult {
  testCaseId: string;
  passed: boolean;
  score: number;
  latencyMs: number;
  tokensUsed: number;
  stepsCount: number;
  failureReason?: string;
}

export function calculateSuccessMetrics(run: TestRun) {
  const total = run.results.length;
  const passed = run.results.filter(r => r.passed).length;

  // Phan loai theo difficulty
  const byDifficulty = {
    easy: run.results.filter(
      r => getTestCase(r.testCaseId)?.difficulty === "easy"
    ),
    medium: run.results.filter(
      r => getTestCase(r.testCaseId)?.difficulty === "medium"
    ),
    hard: run.results.filter(
      r => getTestCase(r.testCaseId)?.difficulty === "hard"
    ),
  };

  const avgLatency = run.results.reduce(
    (sum, r) => sum + r.latencyMs, 0
  ) / total;

  const avgTokens = run.results.reduce(
    (sum, r) => sum + r.tokensUsed, 0
  ) / total;

  const avgSteps = run.results.reduce(
    (sum, r) => sum + r.stepsCount, 0
  ) / total;

  // Chi phi uoc tinh (Claude Sonnet)
  const costPerQuery = avgTokens * 0.000003; // $3/1M tokens

  return {
    overall_pass_rate: Math.round(passed / total * 100),
    easy_pass_rate: calculatePassRate(byDifficulty.easy),
    medium_pass_rate: calculatePassRate(byDifficulty.medium),
    hard_pass_rate: calculatePassRate(byDifficulty.hard),
    avg_latency_ms: Math.round(avgLatency),
    avg_tokens: Math.round(avgTokens),
    avg_steps: Math.round(avgSteps * 10) / 10,
    cost_per_query_usd:
      Math.round(costPerQuery * 10000) / 10000,
    failed_cases: run.results
      .filter(r => !r.passed)
      .map(r => ({
        id: r.testCaseId,
        reason: r.failureReason,
      })),
  };
}

function calculatePassRate(
  results: TestResult[]
): number {
  if (results.length === 0) return 0;
  const passed = results.filter(r => r.passed).length;
  return Math.round(passed / results.length * 100);
}

Cost Benchmarking

Theo dõi chi phí là yếu tố quan trọng cho agent production. Mỗi bước agent gọi API tốn tokens, và chi phí tăng nhanh khi agent chạy nhiều bước.

// src/metrics/cost-tracker.ts
export class CostTracker {
  private records: CostRecord[] = [];

  // Gia theo model (USD per 1M tokens)
  private pricing: Record<string, {
    input: number; output: number
  }> = {
    "claude-sonnet-4-20250514": { input: 3, output: 15 },
    "claude-haiku-4-20250414": { input: 0.25, output: 1.25 },
    "claude-opus-4-20250514": { input: 15, output: 75 },
  };

  record(entry: {
    model: string;
    inputTokens: number;
    outputTokens: number;
    testCaseId: string;
  }) {
    const price = this.pricing[entry.model]
      || { input: 3, output: 15 };
    const cost =
      (entry.inputTokens * price.input +
       entry.outputTokens * price.output) / 1_000_000;

    this.records.push({
      ...entry,
      costUsd: cost,
      timestamp: new Date(),
    });
  }

  getSummary() {
    const totalCost = this.records.reduce(
      (sum, r) => sum + r.costUsd, 0
    );
    const totalQueries = this.records.length;

    return {
      total_cost_usd:
        Math.round(totalCost * 10000) / 10000,
      avg_cost_per_query:
        Math.round(totalCost / totalQueries * 10000) / 10000,
      total_queries: totalQueries,
      cost_by_model: this.groupByModel(),
      projected_monthly_cost_1k_queries:
        Math.round(totalCost / totalQueries * 1000 * 100) / 100,
    };
  }

  private groupByModel() {
    const groups: Record<string, number> = {};
    for (const record of this.records) {
      groups[record.model] =
        (groups[record.model] || 0) + record.costUsd;
    }
    return groups;
  }
}

interface CostRecord {
  model: string;
  inputTokens: number;
  outputTokens: number;
  testCaseId: string;
  costUsd: number;
  timestamp: Date;
}

Regression Testing

Regression testing đảm bảo thay đổi mới (prompt update, model upgrade, tool thêm/sửa) không làm giảm chất lượng agent.

// tests/regression/compare-runs.ts
export function compareRuns(
  baseline: TestRun,
  current: TestRun,
  threshold: number = 5 // Cho phep giam toi da 5%
): RegressionReport {
  const baselineMetrics = calculateSuccessMetrics(baseline);
  const currentMetrics = calculateSuccessMetrics(current);

  const passRateDiff =
    currentMetrics.overall_pass_rate
    - baselineMetrics.overall_pass_rate;

  const latencyDiff =
    ((currentMetrics.avg_latency_ms
      - baselineMetrics.avg_latency_ms)
    / baselineMetrics.avg_latency_ms) * 100;

  const costDiff =
    ((currentMetrics.cost_per_query_usd
      - baselineMetrics.cost_per_query_usd)
    / baselineMetrics.cost_per_query_usd) * 100;

  // Tim cac test case bi regression
  const regressions = [];
  for (const currentResult of current.results) {
    const baselineResult = baseline.results.find(
      r => r.testCaseId === currentResult.testCaseId
    );
    if (baselineResult?.passed && !currentResult.passed) {
      regressions.push({
        testCaseId: currentResult.testCaseId,
        reason: currentResult.failureReason,
      });
    }
  }

  return {
    is_regression: passRateDiff < -threshold,
    pass_rate_change: passRateDiff,
    latency_change_percent: Math.round(latencyDiff),
    cost_change_percent: Math.round(costDiff),
    regressions: regressions,
    improvements: findImprovements(baseline, current),
    recommendation: passRateDiff < -threshold
      ? "BLOCK - Regression vuot nguong cho phep"
      : passRateDiff < 0
      ? "WARNING - Giam nhe, can review"
      : "PASS - Khong co regression",
  };
}

A/B Testing Agents

A/B testing cho phép so sánh hai phiên bản agent trên cùng tập dữ liệu để quyết định phiên bản nào tốt hơn.

// src/ab-testing/runner.ts
export async function runABTest(config: {
  agentA: AgentConfig;
  agentB: AgentConfig;
  testCases: GoldenTestCase[];
  metrics: string[];
}): Promise<ABTestResult> {
  const resultsA: TestResult[] = [];
  const resultsB: TestResult[] = [];

  // Chay ca hai agent tren cung test cases
  for (const testCase of config.testCases) {
    const [resultA, resultB] = await Promise.all([
      runAgent(config.agentA, testCase),
      runAgent(config.agentB, testCase),
    ]);
    resultsA.push(resultA);
    resultsB.push(resultB);
  }

  const metricsA = calculateSuccessMetrics({
    runId: "A", timestamp: new Date(),
    model: config.agentA.model, results: resultsA,
  });
  const metricsB = calculateSuccessMetrics({
    runId: "B", timestamp: new Date(),
    model: config.agentB.model, results: resultsB,
  });

  return {
    agent_a: {
      config: config.agentA,
      metrics: metricsA,
    },
    agent_b: {
      config: config.agentB,
      metrics: metricsB,
    },
    winner: determineWinner(metricsA, metricsB),
    detailed_comparison: compareMetrics(metricsA, metricsB),
  };
}

Các biến thể phổ biến để A/B test

Model comparison: Claude Sonnet vs Claude Haiku cho cùng task (trade-off chất lượng vs chi phí)
Prompt variants: System prompt A vs B (thay đổi instructions, examples)
Tool sets: Bộ tools A vs B (thêm/bớt tools, thay đổi descriptions)
Temperature: temperature 0 vs 0.3 (trade-off consistency vs creativity)
Architecture: Single agent vs multi-agent cho cùng task

Pytest Fixtures cho Agents

Nếu bạn sử dụng Python, pytest fixtures giúp tổ chức test agent hiệu quả.

# tests/conftest.py
import pytest
import anthropic
from unittest.mock import AsyncMock

@pytest.fixture
def claude_client():
    """Client that cho production tests."""
    return anthropic.Anthropic()

@pytest.fixture
def mock_claude_client():
    """Mock client cho unit tests."""
    client = AsyncMock(spec=anthropic.Anthropic)
    client.messages.create = AsyncMock(return_value={
        "content": [{"type": "text", "text": "Mock response"}],
        "stop_reason": "end_turn",
        "usage": {"input_tokens": 100, "output_tokens": 50},
    })
    return client

@pytest.fixture
def golden_dataset():
    """Load golden dataset tu file."""
    import json
    with open("tests/fixtures/golden-dataset.json") as f:
        return json.load(f)

@pytest.fixture
def test_agent(mock_claude_client):
    """Agent voi mock dependencies."""
    from src.agent import Agent
    return Agent(
        client=mock_claude_client,
        model="claude-sonnet-4-20250514",
        tools=get_test_tools(),
    )

@pytest.fixture
def cost_tracker():
    """Cost tracker cho benchmark tests."""
    from src.metrics import CostTracker
    tracker = CostTracker()
    yield tracker
    # In summary sau khi test xong
    print(tracker.getSummary())

CI/CD Pipeline cho Agent Testing

# .github/workflows/agent-tests.yml
name: Agent Test Suite

on:
  push:
    branches: [main]
    paths: ["src/agent/**", "src/tools/**", "tests/**"]
  pull_request:
    branches: [main]

env:
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20" }
      - run: npm ci
      - run: npx vitest run tests/unit --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: unit-coverage
          path: coverage/

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20" }
      - run: npm ci
      - run: npx vitest run tests/integration

  e2e-evaluation:
    runs-on: ubuntu-latest
    needs: integration-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "20" }
      - run: npm ci

      # Chay golden dataset tests
      - run: npx vitest run tests/e2e
        env:
          TEST_DATABASE_URL: ${{ secrets.TEST_DATABASE_URL }}

      # Regression check
      - name: Compare with baseline
        run: |
          node scripts/compare-baseline.js             --baseline results/baseline.json             --current results/current.json             --threshold 5

      # Upload metrics
      - uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

Monitoring Agent Production

Sau khi deploy, cần giám sát agent liên tục:

Success rate theo thời gian: Theo dõi xu hướng, phát hiện sớm khi chất lượng giảm
Latency P50/P95/P99: Đảm bảo response time phù hợp với SLA
Cost per query: Phát hiện sớm khi chi phí tăng bất thường (ví dụ agent lặp vô hạn)
Tool error rate: Tool nào thường xuyên lỗi, cần fix ưu tiên
User satisfaction: Thu thập feedback trực tiếp từ người dùng (thumbs up/down, CSAT)

Bước tiếp theo

Testing AI Agent là quy trình liên tục, không phải việc làm một lần. Bắt đầu với golden dataset nhỏ (20-30 cases), thiết lập CI/CD pipeline, và mở rộng dần test coverage. Ưu tiên regression testing khi thay đổi prompt hoặc model, và sử dụng A/B testing cho các quyết định lớn. Khám phá thêm về xây dựng AI Agent production tại Thư viện Nâng cao Claude.

Tính năng liên quan:Agent Testing Evaluation Framework Golden Dataset A/B Testing

Bai viet co huu ich khong?

Writer cho nền tảng kiến thức Claude AI cho người Việt. Software engineer với hơn 20 năm kinh nghiệm, đam mê AI và chia sẻ kiến thức công nghệ.

5 bài viết · 16K lượt đọc

Bình luận (0)

Đăng nhập để bình luận...

Đăng nhập để bình luận

Đang tải bình luận...

Gợi ý cho bạn

Tool Evaluation — Đánh giá hiệu quả tools trong agent systems

Testing AI Agent — Framework đánh giá và kiểm thử agent production

Điểm nổi bật

Tại sao testing AI Agent khó?

Non-deterministic output

Multi-step complexity

Tool interaction uncertainty

Evaluation subjectivity

Test Categories cho AI Agent

Level 1: Unit Tests — Test từng component

Level 2: Integration Tests — Test tool orchestration

Level 3: End-to-End Tests — Test toàn bộ pipeline

Level 4: Evaluation Tests — Đánh giá bằng LLM

Golden Dataset — Bộ test chuẩn

Cấu trúc golden dataset

Xây dựng golden dataset hiệu quả

Success Rate Measurement

Cost Benchmarking

Regression Testing

A/B Testing Agents

Các biến thể phổ biến để A/B test

Pytest Fixtures cho Agents

CI/CD Pipeline cho Agent Testing

Monitoring Agent Production

Bước tiếp theo

Gợi ý cho bạn

Tool Evaluation — Đánh giá hiệu quả tools trong agent systems

Debug Prompt — Framework xác định và sửa lỗi prompt không hoạt động

Phân loại văn bản với Claude — Hướng dẫn toàn diện

Tạo test data tự động với Claude — Synthetic Test Generation

Tin liên quan nên xem

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

Xây dựng LLM Agent từ đầu — Reference Implementation

Claude System Prompt Mastery — Thiết kế system prompt production-grade

Thiết kế Tool Use cho AI Agent — Nguyên tắc và best practices

Testing AI Agent — Framework đánh giá và kiểm thử agent production

Điểm nổi bật

Tại sao testing AI Agent khó?

Non-deterministic output

Multi-step complexity

Tool interaction uncertainty

Evaluation subjectivity

Test Categories cho AI Agent

Level 1: Unit Tests — Test từng component

Level 2: Integration Tests — Test tool orchestration

Level 3: End-to-End Tests — Test toàn bộ pipeline

Level 4: Evaluation Tests — Đánh giá bằng LLM

Golden Dataset — Bộ test chuẩn

Cấu trúc golden dataset

Xây dựng golden dataset hiệu quả

Success Rate Measurement

Cost Benchmarking

Regression Testing

A/B Testing Agents

Các biến thể phổ biến để A/B test

Pytest Fixtures cho Agents

CI/CD Pipeline cho Agent Testing

Monitoring Agent Production

Bước tiếp theo

Gợi ý cho bạn

Tool Evaluation — Đánh giá hiệu quả tools trong agent systems

Debug Prompt — Framework xác định và sửa lỗi prompt không hoạt động

Phân loại văn bản với Claude — Hướng dẫn toàn diện

Tạo test data tự động với Claude — Synthetic Test Generation

Tin liên quan nên xem

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

Xây dựng LLM Agent từ đầu — Reference Implementation

Claude System Prompt Mastery — Thiết kế system prompt production-grade

Thiết kế Tool Use cho AI Agent — Nguyên tắc và best practices

Đăng ký nhận bản tin