{"product_id":"tool-evaluation-danh-gia-hiệu-quả-tools-trong-agent-systems","title":"Tool Evaluation — Đánh giá hiệu quả tools trong agent systems","description":"\n\u003cp\u003eXây dựng agent với tools không khó. Khó là biết agent của bạn đang \u003cem\u003ehoạt động tốt đến đâu\u003c\/em\u003e. Khi nào Claude chọn đúng tool? Khi nào nó hallucinate tool parameters? Khi nào nó gọi tool không cần thiết? Không có systematic evaluation, bạn chỉ đang đoán mò.\u003c\/p\u003e\n\n\u003cp\u003eBài này trình bày framework đầy đủ để đánh giá tool effectiveness — từ định nghĩa metrics đến automated test suites và continuous monitoring.\u003c\/p\u003e\n\n\u003ch2\u003eTại sao Tool Evaluation quan trọng?\u003c\/h2\u003e\n\n\u003cp\u003eAgent failures thường không rõ ràng như code errors. Agent có thể:\u003c\/p\u003e\n\u003cul\u003e\n  \u003cli\u003eGọi đúng tool nhưng với wrong parameters — kết quả sai nhưng không crash\u003c\/li\u003e\n  \u003cli\u003eKhông gọi tool khi nên gọi — trả lời từ knowledge cũ, không accurate\u003c\/li\u003e\n  \u003cli\u003eGọi tool không cần thiết — tốn tokens và latency\u003c\/li\u003e\n  \u003cli\u003eGọi tool sai — confused giữa search vs calculate vs lookup\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003cp\u003eChỉ với systematic evaluation bạn mới phát hiện và fix những vấn đề này.\u003c\/p\u003e\n\n\u003ch2\u003eFramework: 5 Dimensions of Tool Evaluation\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003efrom dataclasses import dataclass, field\nfrom typing import Optional, Callable\nimport anthropic\nimport json\nimport time\n\nclient = anthropic.Anthropic()\n\n@dataclass\nclass ToolCallExpectation:\n    \"\"\"Kỳ vọng về một tool call cụ thể\"\"\"\n    tool_name: str                          # Tool phải được gọi\n    required_params: dict = field(default_factory=dict)    # Params bắt buộc và giá trị mong đợi\n    forbidden_params: list = field(default_factory=list)   # Params không được có\n    param_validators: dict = field(default_factory=dict)  # Custom validators (key: validator_fn)\n\n@dataclass\nclass TestCase:\n    \"\"\"Một test case cho agent evaluation\"\"\"\n    id: str\n    user_message: str\n    expected_tool_calls: list = field(default_factory=list)  # List[ToolCallExpectation]\n    should_not_call_tools: bool = False      # Đôi khi agent KHÔNG nên gọi tool\n    expected_output_contains: list = field(default_factory=list)  # Keywords trong final response\n    max_tool_calls: int = 5                  # Không gọi quá nhiều tools\n    max_latency_ms: int = 10000             # Latency budget\n\n@dataclass\nclass EvalResult:\n    \"\"\"Kết quả evaluation của một test case\"\"\"\n    test_id: str\n    passed: bool\n    score: float  # 0.0 - 1.0\n    tool_precision: float   # Correct tools called \/ Total tools called\n    tool_recall: float      # Correct tools called \/ Expected tools\n    param_accuracy: float   # Correct params \/ Total params\n    latency_ms: int\n    tool_calls_made: list\n    issues: list\n    response: str\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eCore Evaluation Engine\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eclass ToolEvaluator:\n    def __init__(self, tools: list, system_prompt: str = \"\"):\n        self.tools = tools\n        self.system_prompt = system_prompt\n        self.client = anthropic.Anthropic()\n\n    def evaluate_test_case(self, test: TestCase) -\u0026gt; EvalResult:\n        \"\"\"Chạy một test case và return detailed results\"\"\"\n        start_time = time.time()\n        tool_calls_made = []\n        messages = [{\"role\": \"user\", \"content\": test.user_message}]\n        final_response = \"\"\n        issues = []\n\n        # Run agent loop\n        iteration = 0\n        while iteration \u0026lt; test.max_tool_calls + 1:\n            response = self.client.messages.create(\n                model=\"claude-haiku-4-5\",\n                max_tokens=2000,\n                system=self.system_prompt,\n                tools=self.tools,\n                messages=messages\n            )\n\n            if response.stop_reason == \"end_turn\":\n                final_response = next((b.text for b in response.content if b.type == \"text\"), \"\")\n                break\n\n            elif response.stop_reason == \"tool_use\":\n                # Record tool calls\n                tool_results = []\n                for block in response.content:\n                    if block.type == \"tool_use\":\n                        tool_call = {\n                            \"name\": block.name,\n                            \"input\": block.input,\n                            \"id\": block.id\n                        }\n                        tool_calls_made.append(tool_call)\n\n                        # Mock tool execution\n                        result = self._mock_tool_execution(block.name, block.input)\n                        tool_results.append({\n                            \"type\": \"tool_result\",\n                            \"tool_use_id\": block.id,\n                            \"content\": json.dumps(result)\n                        })\n\n                messages.append({\"role\": \"assistant\", \"content\": response.content})\n                messages.append({\"role\": \"user\", \"content\": tool_results})\n                iteration += 1\n\n            else:\n                break\n\n        latency_ms = int((time.time() - start_time) * 1000)\n\n        # Check latency\n        if latency_ms \u0026gt; test.max_latency_ms:\n            issues.append(f\"Latency {latency_ms}ms exceeded budget {test.max_latency_ms}ms\")\n\n        # Evaluate results\n        scores = self._score_results(\n            test, tool_calls_made, final_response, latency_ms, issues\n        )\n\n        return EvalResult(\n            test_id=test.id,\n            passed=scores[\"passed\"],\n            score=scores[\"overall_score\"],\n            tool_precision=scores[\"precision\"],\n            tool_recall=scores[\"recall\"],\n            param_accuracy=scores[\"param_accuracy\"],\n            latency_ms=latency_ms,\n            tool_calls_made=tool_calls_made,\n            issues=issues,\n            response=final_response\n        )\n\n    def _score_results(self, test: TestCase, tool_calls: list,\n                       response: str, latency_ms: int, issues: list) -\u0026gt; dict:\n        \"\"\"Tính điểm cho test case\"\"\"\n\n        # Check: should NOT call tools\n        if test.should_not_call_tools and tool_calls:\n            issues.append(f\"Called tools when shouldn't: {[tc['name'] for tc in tool_calls]}\")\n            return {\"passed\": False, \"overall_score\": 0.0, \"precision\": 0.0, \"recall\": 0.0, \"param_accuracy\": 0.0}\n\n        if not test.expected_tool_calls:\n            # No expectations about tools — only check response content\n            content_ok = all(kw.lower() in response.lower() for kw in test.expected_output_contains)\n            return {\n                \"passed\": content_ok,\n                \"overall_score\": 1.0 if content_ok else 0.5,\n                \"precision\": 1.0, \"recall\": 1.0, \"param_accuracy\": 1.0\n            }\n\n        # Tool precision: out of all tools called, how many were expected?\n        called_names = [tc[\"name\"] for tc in tool_calls]\n        expected_names = [e.tool_name for e in test.expected_tool_calls]\n\n        correct_calls = [name for name in called_names if name in expected_names]\n        precision = len(correct_calls) \/ len(called_names) if called_calls else 1.0\n        recall = len(correct_calls) \/ len(expected_names) if expected_names else 1.0\n\n        # Unexpected tools\n        unexpected = [name for name in called_names if name not in expected_names]\n        if unexpected:\n            issues.append(f\"Unexpected tool calls: {unexpected}\")\n\n        # Missing tools\n        missing = [name for name in expected_names if name not in called_calls]\n        if missing:\n            issues.append(f\"Missing expected tool calls: {missing}\")\n\n        # Parameter accuracy\n        param_scores = []\n        for expected in test.expected_tool_calls:\n            actual_call = next((tc for tc in tool_calls if tc[\"name\"] == expected.tool_name), None)\n            if not actual_call:\n                param_scores.append(0.0)\n                continue\n\n            actual_input = actual_call.get(\"input\", {})\n            param_score = self._score_params(expected, actual_input, issues)\n            param_scores.append(param_score)\n\n        param_accuracy = sum(param_scores) \/ len(param_scores) if param_scores else 1.0\n\n        # Output content check\n        content_score = 1.0\n        if test.expected_output_contains:\n            matched = sum(1 for kw in test.expected_output_contains if kw.lower() in response.lower())\n            content_score = matched \/ len(test.expected_output_contains)\n            if content_score \u0026lt; 1.0:\n                missing_kw = [kw for kw in test.expected_output_contains if kw.lower() not in response.lower()]\n                issues.append(f\"Response missing keywords: {missing_kw}\")\n\n        overall_score = (precision * 0.3 + recall * 0.3 + param_accuracy * 0.3 + content_score * 0.1)\n        passed = overall_score \u0026gt;= 0.8 and not any(\"Missing expected\" in i for i in issues)\n\n        return {\n            \"passed\": passed,\n            \"overall_score\": round(overall_score, 3),\n            \"precision\": round(precision, 3),\n            \"recall\": round(recall, 3),\n            \"param_accuracy\": round(param_accuracy, 3)\n        }\n\n    def _score_params(self, expected: ToolCallExpectation, actual: dict, issues: list) -\u0026gt; float:\n        \"\"\"Score parameter correctness\"\"\"\n        scores = []\n\n        for param, expected_val in expected.required_params.items():\n            if param not in actual:\n                issues.append(f\"Missing param '{param}' in {expected.tool_name}\")\n                scores.append(0.0)\n            elif expected_val is not None and actual[param] != expected_val:\n                issues.append(f\"Wrong '{param}': expected '{expected_val}', got '{actual[param]}'\")\n                scores.append(0.5)  # Partial credit — param exists but wrong value\n            else:\n                scores.append(1.0)\n\n        # Check forbidden params\n        for param in expected.forbidden_params:\n            if param in actual:\n                issues.append(f\"Forbidden param '{param}' used in {expected.tool_name}\")\n                scores.append(0.0)\n\n        # Custom validators\n        for param, validator in expected.param_validators.items():\n            if param in actual:\n                if not validator(actual[param]):\n                    issues.append(f\"Param '{param}' failed validation in {expected.tool_name}\")\n                    scores.append(0.0)\n\n        return sum(scores) \/ len(scores) if scores else 1.0\n\n    def _mock_tool_execution(self, tool_name: str, tool_input: dict) -\u0026gt; dict:\n        \"\"\"Mock tool results cho evaluation purposes\"\"\"\n        return {\"status\": \"success\", \"result\": f\"Mock result for {tool_name}\", \"input_received\": tool_input}\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eTest Suite Builder\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eclass ToolTestSuite:\n    def __init__(self, name: str, evaluator: ToolEvaluator):\n        self.name = name\n        self.evaluator = evaluator\n        self.test_cases = []\n\n    def add_test(self, test: TestCase):\n        self.test_cases.append(test)\n\n    def run(self) -\u0026gt; dict:\n        \"\"\"Chạy toàn bộ test suite\"\"\"\n        results = []\n        print(f\"\nRunning: {self.name} ({len(self.test_cases)} tests)\")\n        print(\"=\" * 60)\n\n        for test in self.test_cases:\n            result = self.evaluator.evaluate_test_case(test)\n            results.append(result)\n            status = \"PASS\" if result.passed else \"FAIL\"\n            print(f\"  [{status}] {test.id} — score: {result.score:.2f} | latency: {result.latency_ms}ms\")\n            if result.issues:\n                for issue in result.issues[:2]:\n                    print(f\"       Issue: {issue}\")\n\n        return self._aggregate_results(results)\n\n    def _aggregate_results(self, results: list) -\u0026gt; dict:\n        passed = sum(1 for r in results if r.passed)\n        total = len(results)\n\n        report = {\n            \"suite\": self.name,\n            \"total\": total,\n            \"passed\": passed,\n            \"failed\": total - passed,\n            \"pass_rate\": round(passed \/ total * 100, 1) if total \u0026gt; 0 else 0,\n            \"avg_score\": round(sum(r.score for r in results) \/ total, 3) if total \u0026gt; 0 else 0,\n            \"avg_precision\": round(sum(r.tool_precision for r in results) \/ total, 3) if total \u0026gt; 0 else 0,\n            \"avg_recall\": round(sum(r.tool_recall for r in results) \/ total, 3) if total \u0026gt; 0 else 0,\n            \"avg_param_accuracy\": round(sum(r.param_accuracy for r in results) \/ total, 3) if total \u0026gt; 0 else 0,\n            \"avg_latency_ms\": round(sum(r.latency_ms for r in results) \/ total) if total \u0026gt; 0 else 0,\n            \"failures\": [r for r in results if not r.passed]\n        }\n\n        print(f\"\nResults: {passed}\/{total} passed ({report['pass_rate']}%)\")\n        print(f\"  Avg precision: {report['avg_precision']:.3f}\")\n        print(f\"  Avg recall: {report['avg_recall']:.3f}\")\n        print(f\"  Avg param accuracy: {report['avg_param_accuracy']:.3f}\")\n        print(f\"  Avg latency: {report['avg_latency_ms']}ms\")\n        return report\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eVí dụ: Test Suite cho Weather Agent\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eweather_tools = [\n    {\n        \"name\": \"get_weather\",\n        \"description\": \"Get current weather for a city\",\n        \"input_schema\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"city\": {\"type\": \"string\"},\n                \"units\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"], \"default\": \"celsius\"}\n            },\n            \"required\": [\"city\"]\n        }\n    },\n    {\n        \"name\": \"get_forecast\",\n        \"description\": \"Get weather forecast for next N days\",\n        \"input_schema\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"city\": {\"type\": \"string\"},\n                \"days\": {\"type\": \"integer\", \"minimum\": 1, \"maximum\": 7}\n            },\n            \"required\": [\"city\", \"days\"]\n        }\n    }\n]\n\nevaluator = ToolEvaluator(tools=weather_tools, system_prompt=\"You are a weather assistant.\")\nsuite = ToolTestSuite(\"Weather Agent Tests\", evaluator)\n\n# Test 1: Simple current weather\nsuite.add_test(TestCase(\n    id=\"T001_current_weather\",\n    user_message=\"What's the weather in Hanoi right now?\",\n    expected_tool_calls=[\n        ToolCallExpectation(\n            tool_name=\"get_weather\",\n            required_params={\"city\": \"Hanoi\"},\n            param_validators={\"units\": lambda v: v in [\"celsius\", \"fahrenheit\"]}\n        )\n    ],\n    expected_output_contains=[\"Hanoi\", \"weather\"]\n))\n\n# Test 2: Forecast request\nsuite.add_test(TestCase(\n    id=\"T002_5day_forecast\",\n    user_message=\"Give me the 5-day forecast for Ho Chi Minh City\",\n    expected_tool_calls=[\n        ToolCallExpectation(\n            tool_name=\"get_forecast\",\n            required_params={\"city\": \"Ho Chi Minh City\", \"days\": 5}\n        )\n    ]\n))\n\n# Test 3: Should NOT call tools\nsuite.add_test(TestCase(\n    id=\"T003_no_tool_needed\",\n    user_message=\"What's the difference between weather and climate?\",\n    should_not_call_tools=True,\n    expected_output_contains=[\"weather\", \"climate\"]\n))\n\n# Run it\nreport = suite.run()\nprint(f\"\nFinal pass rate: {report['pass_rate']}%\")\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eContinuous Evaluation trong CI\/CD\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003eimport subprocess\nimport sys\n\ndef run_eval_in_ci(suite: ToolTestSuite, min_pass_rate: float = 0.9) -\u0026gt; bool:\n    \"\"\"Gate CI\/CD pipeline dựa trên eval results\"\"\"\n    report = suite.run()\n    pass_rate = report[\"pass_rate\"] \/ 100\n\n    if pass_rate \u0026lt; min_pass_rate:\n        print(f\"\nCI GATE FAILED: {pass_rate*100:.1f}% \u0026lt; {min_pass_rate*100:.1f}% required\")\n        print(\"Failing tests:\")\n        for failure in report[\"failures\"]:\n            print(f\"  - {failure.test_id}: {failure.issues}\")\n        return False\n\n    print(f\"\nCI GATE PASSED: {pass_rate*100:.1f}% \u0026gt;= {min_pass_rate*100:.1f}%\")\n    return True\n\n# In CI pipeline\nsuccess = run_eval_in_ci(suite, min_pass_rate=0.85)\nif not success:\n    sys.exit(1)  # Fail the CI build\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003ePhân tích Failure Patterns\u003c\/h2\u003e\n\n\u003cpre\u003e\u003ccode\u003edef analyze_failures(results: list[EvalResult]) -\u0026gt; dict:\n    \"\"\"Phân tích patterns trong test failures để tìm root causes\"\"\"\n    failures = [r for r in results if not r.passed]\n    if not failures:\n        return {\"message\": \"No failures to analyze\"}\n\n    # Group by failure type\n    param_errors = []\n    wrong_tool = []\n    missing_tool = []\n    over_calling = []\n\n    for f in failures:\n        for issue in f.issues:\n            if \"Wrong\" in issue or \"Missing param\" in issue:\n                param_errors.append(f.test_id)\n            elif \"Unexpected tool\" in issue:\n                wrong_tool.append(f.test_id)\n            elif \"Missing expected\" in issue:\n                missing_tool.append(f.test_id)\n\n    print(\"\nFailure Analysis:\")\n    print(f\"  Parameter errors: {len(param_errors)} cases ({param_errors[:3]})\")\n    print(f\"  Wrong tool called: {len(wrong_tool)} cases\")\n    print(f\"  Tool not called: {len(missing_tool)} cases\")\n\n    recommendations = []\n    if param_errors:\n        recommendations.append(\"Improve tool input_schema descriptions — add examples for each param\")\n    if wrong_tool:\n        recommendations.append(\"Clarify tool descriptions — disambiguate similar tools more clearly\")\n    if missing_tool:\n        recommendations.append(\"Add more trigger phrases in tool descriptions for missed tools\")\n\n    return {\"recommendations\": recommendations}\n\nanalyze_failures([])\u003c\/code\u003e\u003c\/pre\u003e\n\n\u003ch2\u003eTổng kết\u003c\/h2\u003e\n\n\u003cp\u003eTool Evaluation không phải nice-to-have — đây là requirement cho production agent systems. Framework này cung cấp:\u003c\/p\u003e\n\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003e5 metrics\u003c\/strong\u003e: precision, recall, param accuracy, latency, content quality\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eAutomated test suites\u003c\/strong\u003e chạy được trong CI\/CD\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eFailure analysis\u003c\/strong\u003e để tìm root causes và improve tool descriptions\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003ePass\/fail gates\u003c\/strong\u003e để prevent regressions khi update tools\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003cp\u003eĐọc thêm: \u003ca href=\"\/collections\/nang-cao\"\u003eExtended Thinking + Tool Use\u003c\/a\u003e — cải thiện tool selection accuracy với reasoning trước khi action.\u003c\/p\u003e\n\n\u003chr\u003e\n\u003ch3\u003eBài viết liên quan\u003c\/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"\/products\/extended-thinking-tool-use-suy-lu%E1%BA%ADn-sau-k%E1%BA%BFt-h%E1%BB%A3p-cong-c%E1%BB%A5\"\u003eExtended Thinking + Tool Use — Suy luận sâu kết hợp công cụ\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/xay-d%E1%BB%B1ng-llm-agent-t%E1%BB%AB-d%E1%BA%A7u-reference-implementation\"\u003eXây dựng LLM Agent từ đầu — Reference Implementation\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/context-compaction-t%E1%BB%B1-d%E1%BB%99ng-nen-context-cho-conversations-dai\"\u003eContext Compaction — Tự động nén context cho conversations dài\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/claude-cho-engineering-chi%E1%BA%BFn-l%C6%B0%E1%BB%A3c-testing-toan-di%E1%BB%87n\"\u003eClaude cho Engineering: Chiến lược testing toàn diện\u003c\/a\u003e\u003c\/li\u003e\n\u003cli\u003e\u003ca href=\"\/products\/claude-cho-data-phan-tich-d%E1%BB%AF-li%E1%BB%87u-t%E1%BB%B1-d%E1%BB%99ng\"\u003eClaude cho Data: Phân tích dữ liệu tự động\u003c\/a\u003e\u003c\/li\u003e\n\u003c\/ul\u003e","brand":"Minh Tuấn","offers":[{"title":"Default Title","offer_id":47721899884756,"sku":null,"price":0.0,"currency_code":"VND","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0821\/0264\/9044\/files\/tool-evaluation-danh-gia-hi_u-qu_-tools-trong-agent-systems.jpg?v=1774513988","url":"https:\/\/claude.vn\/products\/tool-evaluation-danh-gia-hi%e1%bb%87u-qu%e1%ba%a3-tools-trong-agent-systems","provider":"CLAUDE.VN","version":"1.0","type":"link"}