{"product_id":"testing-ai-agent-framework-danh-gia-va-kiểm-thử-agent-production","title":"Testing AI Agent — Framework đánh giá và kiểm thử agent production","description":"\n\u003cp\u003eAI Agent khác biệt căn bản so với phần mềm truyền thống: output không xác định trước, hành vi thay đổi theo context, và \"đúng\" có nhiều mức độ thay vì chỉ true\/false. Điều này khiến testing agent trở thành một trong những thách thức lớn nhất khi đưa AI vào production. Bài viết này trình bày framework kiểm thử thực tế, từ nguyên tắc cơ bản đến pipeline CI\/CD hoàn chỉnh.\u003c\/p\u003e\n\n\u003ch2\u003eTại sao testing AI Agent khó?\u003c\/h2\u003e\n\u003cp\u003eTesting phần mềm truyền thống dựa trên nguyên tắc: cùng input cho ra cùng output. AI Agent phá vỡ nguyên tắc này ở nhiều cấp độ:\u003c\/p\u003e\n\n\u003ch3\u003eNon-deterministic output\u003c\/h3\u003e\n\u003cp\u003eCùng một câu hỏi, agent có thể trả lời theo nhiều cách khác nhau, tất cả đều đúng. Ví dụ: \"Thời tiết Hà Nội hôm nay thế nào?\" có thể được trả lời bằng \"Hà Nội hôm nay 28 độ C, trời nắng\" hoặc \"Thời tiết tại Hà Nội hiện tại: nhiệt độ 28 độ C, quang mây, độ ẩm 75%\". Cả hai đều đúng nhưng text khác nhau hoàn toàn.\u003c\/p\u003e\n\n\u003ch3\u003eMulti-step complexity\u003c\/h3\u003e\n\u003cp\u003eAgent thực hiện nhiều bước, mỗi bước phụ thuộc vào kết quả bước trước. Một thay đổi nhỏ ở bước 2 có thể dẫn đến kết quả hoàn toàn khác ở bước 5. Testing từng bước riêng lẻ không đảm bảo toàn bộ pipeline hoạt động đúng.\u003c\/p\u003e\n\n\u003ch3\u003eTool interaction uncertainty\u003c\/h3\u003e\n\u003cp\u003eAgent gọi external tools (API, database, web) mà kết quả thay đổi theo thời gian. Dữ liệu hôm nay khác hôm qua, API có thể timeout hoặc trả kết quả khác.\u003c\/p\u003e\n\n\u003ch3\u003eEvaluation subjectivity\u003c\/h3\u003e\n\u003cp\u003eĐánh giá câu trả lời \"tốt\" hay \"xấu\" thường chủ quan. Một câu trả lời ngắn gọn có thể được coi là tốt trong context hỗ trợ khách hàng nhưng thiếu chi tiết trong context nghiên cứu.\u003c\/p\u003e\n\n\u003ch2\u003eTest Categories cho AI Agent\u003c\/h2\u003e\n\u003cp\u003eChia testing thành 4 cấp độ, từ cụ thể đến tổng quát:\u003c\/p\u003e\n\n\u003ch3\u003eLevel 1: Unit Tests — Test từng component\u003c\/h3\u003e\n\u003cp\u003eTest các thành phần riêng lẻ của agent: tool handlers, prompt templates, output parsers, utility functions.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/unit\/tool-handlers.test.ts\nimport { describe, it, expect, vi } from \"vitest\";\nimport { handleSearchTool } from \"..\/..\/src\/tools\/search.js\";\n\ndescribe(\"Search Tool Handler\", () =\u0026gt; {\n  it(\"tra ve ket qua khi tim thay\", async () =\u0026gt; {\n    const mockDb = {\n      search: vi.fn().mockResolvedValue([\n        { id: \"1\", title: \"Bai viet 1\", score: 0.95 },\n        { id: \"2\", title: \"Bai viet 2\", score: 0.82 },\n      ]),\n    };\n\n    const result = await handleSearchTool(\n      { query: \"AI agent\", limit: 5 },\n      mockDb as any\n    );\n\n    expect(JSON.parse(result)).toHaveProperty(\"results\");\n    expect(JSON.parse(result).results).toHaveLength(2);\n  });\n\n  it(\"tra ve mang rong khi khong tim thay\", async () =\u0026gt; {\n    const mockDb = {\n      search: vi.fn().mockResolvedValue([]),\n    };\n\n    const result = await handleSearchTool(\n      { query: \"xyz khong ton tai\", limit: 5 },\n      mockDb as any\n    );\n\n    expect(JSON.parse(result).results).toHaveLength(0);\n  });\n\n  it(\"xu ly loi database gracefully\", async () =\u0026gt; {\n    const mockDb = {\n      search: vi.fn().mockRejectedValue(\n        new Error(\"Connection refused\")\n      ),\n    };\n\n    await expect(\n      handleSearchTool({ query: \"test\", limit: 5 }, mockDb as any)\n    ).rejects.toThrow(\"Connection refused\");\n  });\n});\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eLevel 2: Integration Tests — Test tool orchestration\u003c\/h3\u003e\n\u003cp\u003eTest cách agent gọi tools, xử lý kết quả và quyết định bước tiếp theo.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/integration\/agent-tools.test.ts\nimport { describe, it, expect } from \"vitest\";\nimport { createTestAgent } from \"..\/helpers\/test-agent.js\";\n\ndescribe(\"Agent Tool Orchestration\", () =\u0026gt; {\n  it(\"goi dung tool cho cau hoi ve du lieu\", async () =\u0026gt; {\n    const agent = createTestAgent({\n      tools: [\"search_docs\", \"query_db\", \"calculate\"],\n      mockResponses: {\n        search_docs: '{\"results\": [{\"title\": \"Guide\"}]}',\n        query_db: '{\"rows\": [{\"count\": 42}]}',\n      },\n    });\n\n    const result = await agent.run(\n      \"Co bao nhieu don hang trong thang nay?\"\n    );\n\n    \/\/ Agent phai goi query_db, khong phai search_docs\n    expect(agent.toolCallLog).toContainEqual(\n      expect.objectContaining({ name: \"query_db\" })\n    );\n    expect(agent.toolCallLog).not.toContainEqual(\n      expect.objectContaining({ name: \"search_docs\" })\n    );\n  });\n\n  it(\"goi nhieu tools khi can thong tin tu nhieu nguon\",\n    async () =\u0026gt; {\n    const agent = createTestAgent({\n      tools: [\"search_docs\", \"query_db\"],\n      mockResponses: {\n        search_docs: '{\"results\": [{\"content\": \"Policy A\"}]}',\n        query_db: '{\"rows\": [{\"revenue\": 1000000}]}',\n      },\n    });\n\n    await agent.run(\n      \"So sanh doanh thu voi muc tieu trong chinh sach\"\n    );\n\n    const toolNames = agent.toolCallLog.map(\n      (c: any) =\u0026gt; c.name\n    );\n    expect(toolNames).toContain(\"search_docs\");\n    expect(toolNames).toContain(\"query_db\");\n  });\n\n  it(\"xu ly tool error va thu lai hoac bao loi\",\n    async () =\u0026gt; {\n    const agent = createTestAgent({\n      tools: [\"search_docs\"],\n      mockResponses: {\n        search_docs: new Error(\"Service unavailable\"),\n      },\n    });\n\n    const result = await agent.run(\"Tim tai lieu ve X\");\n\n    \/\/ Agent phai thong bao loi, khong bao gio bia ket qua\n    expect(result.text).toMatch(\n      \/khong the|loi|khong tim duoc|thu lai\/i\n    );\n  });\n});\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eLevel 3: End-to-End Tests — Test toàn bộ pipeline\u003c\/h3\u003e\n\u003cp\u003eTest agent từ đầu đến cuối với dữ liệu thực hoặc gần thực, đo lường chất lượng tổng thể.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/e2e\/agent-quality.test.ts\nimport { describe, it, expect } from \"vitest\";\nimport { ProductionAgent } from \"..\/..\/src\/agent.js\";\nimport { goldenDataset } from \"..\/fixtures\/golden-dataset.js\";\n\ndescribe(\"Agent End-to-End Quality\", () =\u0026gt; {\n  const agent = new ProductionAgent({\n    model: \"claude-sonnet-4-20250514\",\n    \/\/ Su dung test database voi du lieu co dinh\n    dbUrl: process.env.TEST_DATABASE_URL,\n  });\n\n  \/\/ Test tung case trong golden dataset\n  for (const testCase of goldenDataset) {\n    it(\"[\" + testCase.id + \"] \" + testCase.description, async () =\u0026gt; {\n      const result = await agent.run(testCase.input);\n\n      \/\/ Kiem tra cac dieu kien bat buoc\n      for (const assertion of testCase.assertions) {\n        switch (assertion.type) {\n          case \"contains\":\n            expect(result.text.toLowerCase()).toContain(\n              assertion.value.toLowerCase()\n            );\n            break;\n          case \"not_contains\":\n            expect(result.text.toLowerCase()).not.toContain(\n              assertion.value.toLowerCase()\n            );\n            break;\n          case \"tool_used\":\n            expect(\n              result.toolCalls.map((c: any) =\u0026gt; c.name)\n            ).toContain(assertion.value);\n            break;\n          case \"max_steps\":\n            expect(result.steps).toBeLessThanOrEqual(\n              assertion.value\n            );\n            break;\n          case \"max_tokens\":\n            expect(result.totalTokens).toBeLessThanOrEqual(\n              assertion.value\n            );\n            break;\n        }\n      }\n    }, 30000); \/\/ Timeout 30 giay cho E2E test\n  }\n});\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eLevel 4: Evaluation Tests — Đánh giá bằng LLM\u003c\/h3\u003e\n\u003cp\u003eSử dụng một LLM khác (hoặc cùng model với prompt khác) để đánh giá chất lượng câu trả lời. Đây là cách tiếp cận \"LLM-as-Judge\".\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/evaluation\/llm-judge.test.ts\nimport Anthropic from \"@anthropic-ai\/sdk\";\n\nconst judge = new Anthropic();\n\nasync function evaluateAnswer(\n  question: string,\n  answer: string,\n  referenceAnswer: string,\n  criteria: string[]\n): Promise\u0026lt;{\n  score: number;\n  reasoning: string;\n  criteria_scores: Record\u0026lt;string, number\u0026gt;;\n}\u0026gt; {\n  const response = await judge.messages.create({\n    model: \"claude-sonnet-4-20250514\",\n    max_tokens: 1024,\n    messages: [{\n      role: \"user\",\n      content: \"Danh gia chat luong cau tra loi sau.\n\n\"\n        + \"Cau hoi: \" + question + \"\n\n\"\n        + \"Cau tra loi can danh gia:\n\" + answer + \"\n\n\"\n        + \"Cau tra loi tham chieu (ground truth):\n\"\n        + referenceAnswer + \"\n\n\"\n        + \"Tieu chi danh gia:\n\"\n        + criteria.map((c, i) =\u0026gt; (i + 1) + \". \" + c).join(\"\n\")\n        + \"\n\nTra ve JSON voi format:\n\"\n        + '{ \"score\": 0-100, \"reasoning\": \"...\", '\n        + '\"criteria_scores\": { ... } }\n\n'\n        + \"Chi tra ve JSON, khong co text khac.\"\n    }],\n  });\n\n  const text = response.content[0].type === \"text\"\n    ? response.content[0].text : \"\";\n  return JSON.parse(text);\n}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eGolden Dataset — Bộ test chuẩn\u003c\/h2\u003e\n\u003cp\u003eGolden dataset là tập hợp các câu hỏi với câu trả lời chuẩn, dùng làm benchmark cho agent. Đây là nền tảng quan trọng nhất của agent testing.\u003c\/p\u003e\n\n\u003ch3\u003eCấu trúc golden dataset\u003c\/h3\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/fixtures\/golden-dataset.ts\nexport interface GoldenTestCase {\n  id: string;\n  category: string;\n  description: string;\n  input: string;\n  expected_behavior: {\n    should_use_tools: boolean;\n    expected_tools?: string[];\n    max_steps?: number;\n  };\n  reference_answer: string;\n  assertions: Array\u0026lt;{\n    type: \"contains\" | \"not_contains\" | \"tool_used\"\n      | \"max_steps\" | \"max_tokens\";\n    value: any;\n    description: string;\n  }\u0026gt;;\n  difficulty: \"easy\" | \"medium\" | \"hard\";\n  tags: string[];\n}\n\nexport const goldenDataset: GoldenTestCase[] = [\n  {\n    id: \"GD-001\",\n    category: \"product_inquiry\",\n    description: \"Hoi ve thong tin san pham cu the\",\n    input: \"Cho toi biet thong so ky thuat cua MacBook Pro M3\",\n    expected_behavior: {\n      should_use_tools: true,\n      expected_tools: [\"search_products\"],\n      max_steps: 3,\n    },\n    reference_answer: \"MacBook Pro M3 co chip M3, \"\n      + \"RAM 18GB, SSD 512GB, man hinh 14.2 inch...\",\n    assertions: [\n      {\n        type: \"tool_used\",\n        value: \"search_products\",\n        description: \"Phai goi tool tim kiem san pham\",\n      },\n      {\n        type: \"contains\",\n        value: \"M3\",\n        description: \"Phai de cap den chip M3\",\n      },\n      {\n        type: \"not_contains\",\n        value: \"toi khong biet\",\n        description: \"Khong duoc noi khong biet neu co du lieu\",\n      },\n      {\n        type: \"max_steps\",\n        value: 5,\n        description: \"Khong duoc qua 5 buoc\",\n      },\n    ],\n    difficulty: \"easy\",\n    tags: [\"product\", \"search\"],\n  },\n  {\n    id: \"GD-002\",\n    category: \"multi_step_analysis\",\n    description: \"Phan tich yeu cau nhieu buoc\",\n    input: \"So sanh doanh thu thang 1 va thang 2, \"\n      + \"va giai thich nguyen nhan chenh lech \"\n      + \"dua tren feedback khach hang\",\n    expected_behavior: {\n      should_use_tools: true,\n      expected_tools: [\"query_db\", \"search_feedback\"],\n      max_steps: 8,\n    },\n    reference_answer: \"Doanh thu T1: 500tr, T2: 650tr, \"\n      + \"tang 30%. Nguyen nhan chinh tu feedback: ...\",\n    assertions: [\n      {\n        type: \"tool_used\",\n        value: \"query_db\",\n        description: \"Phai query database de lay doanh thu\",\n      },\n      {\n        type: \"tool_used\",\n        value: \"search_feedback\",\n        description: \"Phai search feedback khach hang\",\n      },\n      {\n        type: \"contains\",\n        value: \"doanh thu\",\n        description: \"Phai de cap den doanh thu\",\n      },\n      {\n        type: \"max_steps\",\n        value: 10,\n        description: \"Khong duoc qua 10 buoc\",\n      },\n    ],\n    difficulty: \"hard\",\n    tags: [\"analysis\", \"multi-step\", \"data\"],\n  },\n  {\n    id: \"GD-003\",\n    category: \"boundary\",\n    description: \"Cau hoi ngoai pham vi\",\n    input: \"Lam the nao de nau pho bo?\",\n    expected_behavior: {\n      should_use_tools: false,\n      max_steps: 1,\n    },\n    reference_answer: \"Xin loi, toi la tro ly ho tro san pham. \"\n      + \"Toi khong the giup ban ve cong thuc nau an.\",\n    assertions: [\n      {\n        type: \"not_contains\",\n        value: \"nguyen lieu\",\n        description: \"Khong duoc tra loi cau hoi ngoai pham vi\",\n      },\n      {\n        type: \"max_steps\",\n        value: 2,\n        description: \"Phai tu choi nhanh, khong tim kiem\",\n      },\n    ],\n    difficulty: \"easy\",\n    tags: [\"boundary\", \"out-of-scope\"],\n  },\n];\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eXây dựng golden dataset hiệu quả\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eBắt đầu nhỏ:\u003c\/strong\u003e 20-30 test cases cho MVP, mở rộng dần\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003ePhủ đều categories:\u003c\/strong\u003e Đảm bảo mỗi loại câu hỏi agent xử lý đều có test cases\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eBao gồm edge cases:\u003c\/strong\u003e Câu hỏi mơ hồ, ngoài phạm vi, input sai format\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eThu thập từ production:\u003c\/strong\u003e Log câu hỏi thực tế, chọn các case đại diện\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eVersion control:\u003c\/strong\u003e Golden dataset phải được quản lý trong git, có changelog\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eSuccess Rate Measurement\u003c\/h2\u003e\n\u003cp\u003eĐo lường success rate là cách chính để theo dõi chất lượng agent theo thời gian.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ src\/metrics\/success-rate.ts\nexport interface TestRun {\n  runId: string;\n  timestamp: Date;\n  model: string;\n  results: TestResult[];\n}\n\nexport interface TestResult {\n  testCaseId: string;\n  passed: boolean;\n  score: number;\n  latencyMs: number;\n  tokensUsed: number;\n  stepsCount: number;\n  failureReason?: string;\n}\n\nexport function calculateSuccessMetrics(run: TestRun) {\n  const total = run.results.length;\n  const passed = run.results.filter(r =\u0026gt; r.passed).length;\n\n  \/\/ Phan loai theo difficulty\n  const byDifficulty = {\n    easy: run.results.filter(\n      r =\u0026gt; getTestCase(r.testCaseId)?.difficulty === \"easy\"\n    ),\n    medium: run.results.filter(\n      r =\u0026gt; getTestCase(r.testCaseId)?.difficulty === \"medium\"\n    ),\n    hard: run.results.filter(\n      r =\u0026gt; getTestCase(r.testCaseId)?.difficulty === \"hard\"\n    ),\n  };\n\n  const avgLatency = run.results.reduce(\n    (sum, r) =\u0026gt; sum + r.latencyMs, 0\n  ) \/ total;\n\n  const avgTokens = run.results.reduce(\n    (sum, r) =\u0026gt; sum + r.tokensUsed, 0\n  ) \/ total;\n\n  const avgSteps = run.results.reduce(\n    (sum, r) =\u0026gt; sum + r.stepsCount, 0\n  ) \/ total;\n\n  \/\/ Chi phi uoc tinh (Claude Sonnet)\n  const costPerQuery = avgTokens * 0.000003; \/\/ $3\/1M tokens\n\n  return {\n    overall_pass_rate: Math.round(passed \/ total * 100),\n    easy_pass_rate: calculatePassRate(byDifficulty.easy),\n    medium_pass_rate: calculatePassRate(byDifficulty.medium),\n    hard_pass_rate: calculatePassRate(byDifficulty.hard),\n    avg_latency_ms: Math.round(avgLatency),\n    avg_tokens: Math.round(avgTokens),\n    avg_steps: Math.round(avgSteps * 10) \/ 10,\n    cost_per_query_usd:\n      Math.round(costPerQuery * 10000) \/ 10000,\n    failed_cases: run.results\n      .filter(r =\u0026gt; !r.passed)\n      .map(r =\u0026gt; ({\n        id: r.testCaseId,\n        reason: r.failureReason,\n      })),\n  };\n}\n\nfunction calculatePassRate(\n  results: TestResult[]\n): number {\n  if (results.length === 0) return 0;\n  const passed = results.filter(r =\u0026gt; r.passed).length;\n  return Math.round(passed \/ results.length * 100);\n}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eCost Benchmarking\u003c\/h2\u003e\n\u003cp\u003eTheo dõi chi phí là yếu tố quan trọng cho agent production. Mỗi bước agent gọi API tốn tokens, và chi phí tăng nhanh khi agent chạy nhiều bước.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ src\/metrics\/cost-tracker.ts\nexport class CostTracker {\n  private records: CostRecord[] = [];\n\n  \/\/ Gia theo model (USD per 1M tokens)\n  private pricing: Record\u0026lt;string, {\n    input: number; output: number\n  }\u0026gt; = {\n    \"claude-sonnet-4-20250514\": { input: 3, output: 15 },\n    \"claude-haiku-4-20250414\": { input: 0.25, output: 1.25 },\n    \"claude-opus-4-20250514\": { input: 15, output: 75 },\n  };\n\n  record(entry: {\n    model: string;\n    inputTokens: number;\n    outputTokens: number;\n    testCaseId: string;\n  }) {\n    const price = this.pricing[entry.model]\n      || { input: 3, output: 15 };\n    const cost =\n      (entry.inputTokens * price.input +\n       entry.outputTokens * price.output) \/ 1_000_000;\n\n    this.records.push({\n      ...entry,\n      costUsd: cost,\n      timestamp: new Date(),\n    });\n  }\n\n  getSummary() {\n    const totalCost = this.records.reduce(\n      (sum, r) =\u0026gt; sum + r.costUsd, 0\n    );\n    const totalQueries = this.records.length;\n\n    return {\n      total_cost_usd:\n        Math.round(totalCost * 10000) \/ 10000,\n      avg_cost_per_query:\n        Math.round(totalCost \/ totalQueries * 10000) \/ 10000,\n      total_queries: totalQueries,\n      cost_by_model: this.groupByModel(),\n      projected_monthly_cost_1k_queries:\n        Math.round(totalCost \/ totalQueries * 1000 * 100) \/ 100,\n    };\n  }\n\n  private groupByModel() {\n    const groups: Record\u0026lt;string, number\u0026gt; = {};\n    for (const record of this.records) {\n      groups[record.model] =\n        (groups[record.model] || 0) + record.costUsd;\n    }\n    return groups;\n  }\n}\n\ninterface CostRecord {\n  model: string;\n  inputTokens: number;\n  outputTokens: number;\n  testCaseId: string;\n  costUsd: number;\n  timestamp: Date;\n}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eRegression Testing\u003c\/h2\u003e\n\u003cp\u003eRegression testing đảm bảo thay đổi mới (prompt update, model upgrade, tool thêm\/sửa) không làm giảm chất lượng agent.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ tests\/regression\/compare-runs.ts\nexport function compareRuns(\n  baseline: TestRun,\n  current: TestRun,\n  threshold: number = 5 \/\/ Cho phep giam toi da 5%\n): RegressionReport {\n  const baselineMetrics = calculateSuccessMetrics(baseline);\n  const currentMetrics = calculateSuccessMetrics(current);\n\n  const passRateDiff =\n    currentMetrics.overall_pass_rate\n    - baselineMetrics.overall_pass_rate;\n\n  const latencyDiff =\n    ((currentMetrics.avg_latency_ms\n      - baselineMetrics.avg_latency_ms)\n    \/ baselineMetrics.avg_latency_ms) * 100;\n\n  const costDiff =\n    ((currentMetrics.cost_per_query_usd\n      - baselineMetrics.cost_per_query_usd)\n    \/ baselineMetrics.cost_per_query_usd) * 100;\n\n  \/\/ Tim cac test case bi regression\n  const regressions = [];\n  for (const currentResult of current.results) {\n    const baselineResult = baseline.results.find(\n      r =\u0026gt; r.testCaseId === currentResult.testCaseId\n    );\n    if (baselineResult?.passed \u0026amp;\u0026amp; !currentResult.passed) {\n      regressions.push({\n        testCaseId: currentResult.testCaseId,\n        reason: currentResult.failureReason,\n      });\n    }\n  }\n\n  return {\n    is_regression: passRateDiff \u0026lt; -threshold,\n    pass_rate_change: passRateDiff,\n    latency_change_percent: Math.round(latencyDiff),\n    cost_change_percent: Math.round(costDiff),\n    regressions: regressions,\n    improvements: findImprovements(baseline, current),\n    recommendation: passRateDiff \u0026lt; -threshold\n      ? \"BLOCK - Regression vuot nguong cho phep\"\n      : passRateDiff \u0026lt; 0\n      ? \"WARNING - Giam nhe, can review\"\n      : \"PASS - Khong co regression\",\n  };\n}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eA\/B Testing Agents\u003c\/h2\u003e\n\u003cp\u003eA\/B testing cho phép so sánh hai phiên bản agent trên cùng tập dữ liệu để quyết định phiên bản nào tốt hơn.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e\/\/ src\/ab-testing\/runner.ts\nexport async function runABTest(config: {\n  agentA: AgentConfig;\n  agentB: AgentConfig;\n  testCases: GoldenTestCase[];\n  metrics: string[];\n}): Promise\u0026lt;ABTestResult\u0026gt; {\n  const resultsA: TestResult[] = [];\n  const resultsB: TestResult[] = [];\n\n  \/\/ Chay ca hai agent tren cung test cases\n  for (const testCase of config.testCases) {\n    const [resultA, resultB] = await Promise.all([\n      runAgent(config.agentA, testCase),\n      runAgent(config.agentB, testCase),\n    ]);\n    resultsA.push(resultA);\n    resultsB.push(resultB);\n  }\n\n  const metricsA = calculateSuccessMetrics({\n    runId: \"A\", timestamp: new Date(),\n    model: config.agentA.model, results: resultsA,\n  });\n  const metricsB = calculateSuccessMetrics({\n    runId: \"B\", timestamp: new Date(),\n    model: config.agentB.model, results: resultsB,\n  });\n\n  return {\n    agent_a: {\n      config: config.agentA,\n      metrics: metricsA,\n    },\n    agent_b: {\n      config: config.agentB,\n      metrics: metricsB,\n    },\n    winner: determineWinner(metricsA, metricsB),\n    detailed_comparison: compareMetrics(metricsA, metricsB),\n  };\n}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch3\u003eCác biến thể phổ biến để A\/B test\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eModel comparison:\u003c\/strong\u003e Claude Sonnet vs Claude Haiku cho cùng task (trade-off chất lượng vs chi phí)\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003ePrompt variants:\u003c\/strong\u003e System prompt A vs B (thay đổi instructions, examples)\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eTool sets:\u003c\/strong\u003e Bộ tools A vs B (thêm\/bớt tools, thay đổi descriptions)\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eTemperature:\u003c\/strong\u003e temperature 0 vs 0.3 (trade-off consistency vs creativity)\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eArchitecture:\u003c\/strong\u003e Single agent vs multi-agent cho cùng task\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003ePytest Fixtures cho Agents\u003c\/h2\u003e\n\u003cp\u003eNếu bạn sử dụng Python, pytest fixtures giúp tổ chức test agent hiệu quả.\u003c\/p\u003e\n\u003cpre\u003e\u003ccode\u003e# tests\/conftest.py\nimport pytest\nimport anthropic\nfrom unittest.mock import AsyncMock\n\n@pytest.fixture\ndef claude_client():\n    \"\"\"Client that cho production tests.\"\"\"\n    return anthropic.Anthropic()\n\n@pytest.fixture\ndef mock_claude_client():\n    \"\"\"Mock client cho unit tests.\"\"\"\n    client = AsyncMock(spec=anthropic.Anthropic)\n    client.messages.create = AsyncMock(return_value={\n        \"content\": [{\"type\": \"text\", \"text\": \"Mock response\"}],\n        \"stop_reason\": \"end_turn\",\n        \"usage\": {\"input_tokens\": 100, \"output_tokens\": 50},\n    })\n    return client\n\n@pytest.fixture\ndef golden_dataset():\n    \"\"\"Load golden dataset tu file.\"\"\"\n    import json\n    with open(\"tests\/fixtures\/golden-dataset.json\") as f:\n        return json.load(f)\n\n@pytest.fixture\ndef test_agent(mock_claude_client):\n    \"\"\"Agent voi mock dependencies.\"\"\"\n    from src.agent import Agent\n    return Agent(\n        client=mock_claude_client,\n        model=\"claude-sonnet-4-20250514\",\n        tools=get_test_tools(),\n    )\n\n@pytest.fixture\ndef cost_tracker():\n    \"\"\"Cost tracker cho benchmark tests.\"\"\"\n    from src.metrics import CostTracker\n    tracker = CostTracker()\n    yield tracker\n    # In summary sau khi test xong\n    print(tracker.getSummary())\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eCI\/CD Pipeline cho Agent Testing\u003c\/h2\u003e\n\u003cpre\u003e\u003ccode\u003e# .github\/workflows\/agent-tests.yml\nname: Agent Test Suite\n\non:\n  push:\n    branches: [main]\n    paths: [\"src\/agent\/**\", \"src\/tools\/**\", \"tests\/**\"]\n  pull_request:\n    branches: [main]\n\nenv:\n  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n\njobs:\n  unit-tests:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions\/checkout@v4\n      - uses: actions\/setup-node@v4\n        with: { node-version: \"20\" }\n      - run: npm ci\n      - run: npx vitest run tests\/unit --coverage\n      - uses: actions\/upload-artifact@v4\n        with:\n          name: unit-coverage\n          path: coverage\/\n\n  integration-tests:\n    runs-on: ubuntu-latest\n    needs: unit-tests\n    steps:\n      - uses: actions\/checkout@v4\n      - uses: actions\/setup-node@v4\n        with: { node-version: \"20\" }\n      - run: npm ci\n      - run: npx vitest run tests\/integration\n\n  e2e-evaluation:\n    runs-on: ubuntu-latest\n    needs: integration-tests\n    steps:\n      - uses: actions\/checkout@v4\n      - uses: actions\/setup-node@v4\n        with: { node-version: \"20\" }\n      - run: npm ci\n\n      # Chay golden dataset tests\n      - run: npx vitest run tests\/e2e\n        env:\n          TEST_DATABASE_URL: ${{ secrets.TEST_DATABASE_URL }}\n\n      # Regression check\n      - name: Compare with baseline\n        run: |\n          node scripts\/compare-baseline.js             --baseline results\/baseline.json             --current results\/current.json             --threshold 5\n\n      # Upload metrics\n      - uses: actions\/upload-artifact@v4\n        with:\n          name: eval-results\n          path: results\/\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eMonitoring Agent Production\u003c\/h2\u003e\n\u003cp\u003eSau khi deploy, cần giám sát agent liên tục:\u003c\/p\u003e\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eSuccess rate theo thời gian:\u003c\/strong\u003e Theo dõi xu hướng, phát hiện sớm khi chất lượng giảm\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eLatency P50\/P95\/P99:\u003c\/strong\u003e Đảm bảo response time phù hợp với SLA\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eCost per query:\u003c\/strong\u003e Phát hiện sớm khi chi phí tăng bất thường (ví dụ agent lặp vô hạn)\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eTool error rate:\u003c\/strong\u003e Tool nào thường xuyên lỗi, cần fix ưu tiên\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eUser satisfaction:\u003c\/strong\u003e Thu thập feedback trực tiếp từ người dùng (thumbs up\/down, CSAT)\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch2\u003eBước tiếp theo\u003c\/h2\u003e\n\u003cp\u003eTesting AI Agent là quy trình liên tục, không phải việc làm một lần. Bắt đầu với golden dataset nhỏ (20-30 cases), thiết lập CI\/CD pipeline, và mở rộng dần test coverage. Ưu tiên regression testing khi thay đổi prompt hoặc model, và sử dụng A\/B testing cho các quyết định lớn. Khám phá thêm về xây dựng AI Agent production tại \u003ca href=\"\/en\/collections\/nang-cao\"\u003eThư viện Nâng cao Claude\u003c\/a\u003e.\u003c\/p\u003e\n","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47730150310100,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/testing-ai-agent-framework-danh-gia-va-ki_m-th_-agent-production.jpg?v=1774715482","url":"https:\/\/claude.vn\/en\/products\/testing-ai-agent-framework-danh-gia-va-ki%e1%bb%83m-th%e1%bb%ad-agent-production","provider":"CLAUDE.VN","version":"1.0","type":"link"}