Trung cấpKỹ thuậtClaude APINguồn: Anthropic

Evaluator-Optimizer — Tự cải thiện output với feedback loop

Minh TuấnCTO, Transform GroupTheo dõi

26/03/2026 763 0 6 phút đọc

Nghe bài viết

00:00

1 Bước đầu tiên bạn nên làm: Pattern này gồm hai thành phần chính hoạt động trong vòng lặp: Generator — Tạo output ban đầu hoặc cải thiện theo. Áp dụng đúng cách sẽ thấy kết quả rõ rệt từ tuần đầu tiên.
2 Góc nhìn thực tế: Viết product description đạt chất lượng cao optimizer = EvaluatorOptimizer taskdescription="""Write a product. Điều quan trọng là hiểu rõ khi nào nên và không nên áp dụng phương pháp này.
3 Không thể bỏ qua: class CodeOptimizerEvaluatorOptimizer: """Chuyên biệt cho code generation + review""" def initself, language: str,. Đây là kiến thức nền tảng mà mọi người làm việc với AI đều cần hiểu rõ.
4 Bước đầu tiên bạn nên làm: Khi cần xử lý nhiều items song song, mỗi item có riêng feedback loop: import asyncio import anthropic asyncclient =. Áp dụng đúng cách sẽ thấy kết quả rõ rệt từ tuần đầu tiên.
5 Góc nhìn thực tế: def analyzeoptimizationhistoryhistory: listdict -> dict: """Phân tích hiệu quả của optimization loop""" if not. Điều quan trọng là hiểu rõ khi nào nên và không nên áp dụng phương pháp này.

Trong quá trình viết lách, lập trình hay phân tích, con người thường làm theo vòng lặp: tạo ra thứ gì đó, xem lại, cải thiện, xem lại lần nữa. Evaluator-Optimizer pattern mang chính vòng lặp tự nhiên này vào AI — tạo ra hệ thống tự cải thiện mà không cần can thiệp thủ công.

Kiến trúc Evaluator-Optimizer

Pattern này gồm hai thành phần chính hoạt động trong vòng lặp:

Generator — Tạo output ban đầu hoặc cải thiện theo feedback
Evaluator — Đánh giá output theo tiêu chí cụ thể, cung cấp actionable feedback

Vòng lặp tiếp tục cho đến khi Evaluator chấp nhận output (đủ tốt) hoặc đến max iterations.

import anthropic
import json

client = anthropic.Anthropic()

class EvaluatorOptimizer:
    def __init__(self, task_description: str, evaluation_criteria: list[str],
                 max_iterations: int = 4, quality_threshold: int = 8):
        self.task = task_description
        self.criteria = evaluation_criteria
        self.max_iterations = max_iterations
        self.threshold = quality_threshold
        self.history = []

    def run(self, initial_input: str = None) -> dict:
        current_output = None
        feedback_history = []

        for iteration in range(self.max_iterations):
            print(f"
--- Iteration {iteration + 1}/{self.max_iterations} ---")

            # Generate
            current_output = self._generate(
                initial_input or self.task,
                current_output,
                feedback_history
            )
            print(f"Generated ({len(current_output)} chars)")

            # Evaluate
            eval_result = self._evaluate(current_output)
            score = eval_result["score"]
            feedback = eval_result["feedback"]

            print(f"Score: {score}/10")
            print(f"Feedback: {feedback[:100]}...")

            self.history.append({
                "iteration": iteration + 1,
                "output": current_output,
                "score": score,
                "feedback": feedback
            })

            if score >= self.threshold:
                print(f"Quality threshold {self.threshold} reached!")
                return {
                    "output": current_output,
                    "final_score": score,
                    "iterations_used": iteration + 1,
                    "approved": True,
                    "history": self.history
                }

            feedback_history.append(f"Round {iteration+1} (Score {score}/10): {feedback}")

        return {
            "output": current_output,
            "final_score": eval_result["score"],
            "iterations_used": self.max_iterations,
            "approved": False,
            "history": self.history
        }

    def _generate(self, task: str, previous_output: str, feedback_history: list) -> str:
        if not previous_output:
            prompt = f"Complete this task:

{task}"
        else:
            feedback_text = "
".join(feedback_history)
            prompt = f"""Task: {task}

Your previous attempt:
{previous_output}

Feedback from evaluator (most recent last):
{feedback_text}

Rewrite the output addressing ALL feedback points. Be specific about what you improved."""

        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=3000,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

    def _evaluate(self, output: str) -> dict:
        criteria_text = "
".join([f"- {c}" for c in self.criteria])

        eval_prompt = f"""Evaluate this output against the criteria below.

Output to evaluate:
{output}

Evaluation criteria:
{criteria_text}

Respond in JSON format:
{{
  "score": [1-10 integer],
  "criteria_scores": {{"criterion": score}},
  "strengths": ["list of what's good"],
  "improvements": ["specific actionable improvements needed"],
  "feedback": "concise summary for the generator"
}}"""

        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=1000,
            messages=[{"role": "user", "content": eval_prompt}]
        )

        try:
            text = response.content[0].text
            # Extract JSON
            start = text.find('{')
            end = text.rfind('}') + 1
            result = json.loads(text[start:end])
            return result
        except Exception:
            return {"score": 5, "feedback": response.content[0].text}

Ví dụ 1: Tối ưu hóa nội dung Marketing

# Viết product description đạt chất lượng cao
optimizer = EvaluatorOptimizer(
    task_description="""Write a product description for:
Product: Claude API Professional Plan
Target audience: Vietnamese tech startup CTOs
Goal: Drive trial sign-ups""",

    evaluation_criteria=[
        "Clear value proposition in first sentence",
        "Addresses specific pain points of Vietnamese startups",
        "Includes concrete numbers or metrics",
        "Has strong call-to-action",
        "Tone is professional but approachable",
        "Length: 150-200 words",
        "Mentions at least 3 specific features",
    ],
    max_iterations=4,
    quality_threshold=8
)

result = optimizer.run()

print(f"
Final output (score: {result['final_score']}/10):")
print(result['output'])
print(f"
Iterations used: {result['iterations_used']}")
print(f"Approved: {result['approved']}")

Ví dụ 2: Code Review Loop

class CodeOptimizer(EvaluatorOptimizer):
    """Chuyên biệt cho code generation + review"""

    def __init__(self, language: str, task: str):
        super().__init__(
            task_description=task,
            evaluation_criteria=[
                f"Code is valid {language} syntax",
                "No obvious bugs or edge case failures",
                "Has error handling for common failures",
                "Variables and functions have clear, descriptive names",
                "Has docstrings/comments for complex logic",
                "Time complexity is reasonable",
                "No hardcoded secrets or credentials",
                "Follows language best practices"
            ],
            max_iterations=3,
            quality_threshold=8
        )
        self.language = language

    def _generate(self, task, previous_output, feedback_history):
        if not previous_output:
            prompt = f"Write {self.language} code for: {task}"
        else:
            feedback_text = "
".join(feedback_history)
            prompt = f"""Fix and improve this {self.language} code.

Original task: {task}

Current code:
{previous_output}

Issues to fix:
{feedback_text}

Provide the complete improved code."""

        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=4000,
            system=f"You are an expert {self.language} developer. Write production-quality code.",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

code_optimizer = CodeOptimizer(
    language="Python",
    task="""Create a rate-limited API client class that:
- Makes HTTP requests with retry logic
- Respects rate limits (max 10 req/second)
- Handles 429 responses with exponential backoff
- Logs all requests and errors"""
)

result = code_optimizer.run()
print(result['output'])

Ví dụ 3: Translation Quality Loop

def translation_optimizer(source_text: str, target_language: str = "Vietnamese"):
    """Tối ưu bản dịch qua nhiều vòng lặp"""

    optimizer = EvaluatorOptimizer(
        task_description=f"""Translate this text to {target_language}:

{source_text}""",

        evaluation_criteria=[
            "Meaning is fully preserved — no information lost",
            "Natural, fluent language (not robotic machine translation)",
            "Technical terms are translated consistently",
            "Tone matches the original (formal/casual)",
            f"Grammar is correct {target_language}",
            "Cultural adaptations are appropriate",
            "No untranslated segments remain"
        ],
        max_iterations=3,
        quality_threshold=9  # Translation cần tiêu chuẩn cao hơn
    )

    return optimizer.run()

text_to_translate = """
The Evaluator-Optimizer pattern represents a significant advancement
in agentic AI systems. Rather than accepting first-pass outputs,
this pattern implements a quality feedback loop that mirrors
human revision processes, leading to substantially better results.
"""

result = translation_optimizer(text_to_translate)
print(result['output'])

Advanced: Async Evaluator-Optimizer

Khi cần xử lý nhiều items song song, mỗi item có riêng feedback loop:

import asyncio
import anthropic

async_client = anthropic.AsyncAnthropic()

async def async_optimize_batch(tasks: list[str], max_iter: int = 3) -> list[dict]:
    """Optimize nhiều tasks song song, mỗi task có loop riêng"""

    async def optimize_single(task: str, task_id: int) -> dict:
        current = None
        feedback_hist = []

        for i in range(max_iter):
            # Generate
            gen_prompt = task if not current else f"""
Task: {task}
Previous: {current}
Feedback: {chr(10).join(feedback_hist)}
Improve:"""

            gen = await async_client.messages.create(
                model="claude-haiku-4-5",
                max_tokens=1000,
                messages=[{"role": "user", "content": gen_prompt}]
            )
            current = gen.content[0].text

            # Evaluate
            eval_r = await async_client.messages.create(
                model="claude-haiku-4-5",
                max_tokens=200,
                messages=[{
                    "role": "user",
                    "content": f"Rate this output 1-10 and give brief feedback.
Task: {task}
Output: {current}
Respond: SCORE:X FEEDBACK:..."
                }]
            )
            eval_text = eval_r.content[0].text

            try:
                score = int(eval_text.split("SCORE:")[1].split()[0])
            except Exception:
                score = 5

            if score >= 8:
                return {"id": task_id, "output": current, "score": score, "iterations": i+1}

            feedback_hist.append(eval_text)

        return {"id": task_id, "output": current, "score": score, "iterations": max_iter}

    tasks_coroutines = [optimize_single(task, i) for i, task in enumerate(tasks)]
    return await asyncio.gather(*tasks_coroutines)

# Chạy
tasks = [
    "Write a tweet about AI in healthcare",
    "Write a tweet about climate change solutions",
    "Write a tweet about remote work productivity"
]
results = asyncio.run(async_optimize_batch(tasks))

Metrics và Monitoring

def analyze_optimization_history(history: list[dict]) -> dict:
    """Phân tích hiệu quả của optimization loop"""
    if not history:
        return {}

    scores = [h["score"] for h in history]
    improvements = [scores[i+1] - scores[i] for i in range(len(scores)-1)]

    return {
        "initial_score": scores[0],
        "final_score": scores[-1],
        "total_improvement": scores[-1] - scores[0],
        "iterations": len(history),
        "avg_improvement_per_round": sum(improvements) / len(improvements) if improvements else 0,
        "best_iteration": max(range(len(scores)), key=lambda i: scores[i]) + 1,
        "diminishing_returns": improvements[-1] < improvements[0] if len(improvements) >= 2 else None
    }

result = optimizer.run()
stats = analyze_optimization_history(result["history"])
print(f"Improved from {stats['initial_score']} to {stats['final_score']} in {stats['iterations']} iterations")

Khi nào dùng Evaluator-Optimizer

Scenario	Phù hợp?	Lý do
Content marketing	Rất phù hợp	Quality subjective, cần iterate
Code generation	Rất phù hợp	Bugs có thể fix qua feedback
Translation	Phù hợp	Fluency cải thiện qua review
Data extraction	Ít phù hợp	Đúng/sai rõ ràng, không cần loop
Simple Q&A	Không phù hợp	Overkill, tốn tokens không cần thiết
Real-time chat	Không phù hợp	Latency quá cao

Tổng kết

Evaluator-Optimizer là pattern mạnh nhất khi output quality là ưu tiên hàng đầu. Pattern này tự động hóa quá trình review-revise mà thông thường cần human editorial judgment.

Key insights:

Evaluator cần criteria cụ thể, không mơ hồ — "good writing" không đủ, cần "150 words, 3 features mentioned, CTA present"
Dùng model nhỏ hơn (haiku) cho evaluator để tiết kiệm chi phí
3-4 iterations thường là điểm tối ưu — sau đó returns giảm dần
Log history để phân tích và improve prompts

Xem thêm: Orchestrator-Workers Pattern để kết hợp cả hai pattern cho tasks phức tạp nhất.

Gợi ý cho bạn

Agent Loop — Nền tảng xây dựng AI Agent với Claude

Evaluator-Optimizer — Tự cải thiện output với feedback loop

Điểm nổi bật

Kiến trúc Evaluator-Optimizer

Ví dụ 1: Tối ưu hóa nội dung Marketing

Ví dụ 2: Code Review Loop

Ví dụ 3: Translation Quality Loop

Advanced: Async Evaluator-Optimizer

Metrics và Monitoring

Khi nào dùng Evaluator-Optimizer

Tổng kết

Bài viết liên quan

Gợi ý cho bạn

Agent Loop — Nền tảng xây dựng AI Agent với Claude

Orchestrator-Workers — Kiến trúc điều phối agent phức tạp

Human-in-the-Loop — Khi nào AI Agent cần xin phép con người

Retrieval Agent — Xây dựng Agentic RAG với Claude

Tin liên quan nên xem

Voice Assistant với ElevenLabs + Claude — Trợ lý giọng nói

Testing AI Agent — Framework đánh giá và kiểm thử agent production

Memory Management — Quản lý bộ nhớ dài hạn cho Claude agents

Xây dựng LLM Agent từ đầu — Reference Implementation

Evaluator-Optimizer — Tự cải thiện output với feedback loop

Điểm nổi bật

Kiến trúc Evaluator-Optimizer

Ví dụ 1: Tối ưu hóa nội dung Marketing

Ví dụ 2: Code Review Loop

Ví dụ 3: Translation Quality Loop

Advanced: Async Evaluator-Optimizer

Metrics và Monitoring

Khi nào dùng Evaluator-Optimizer

Tổng kết

Bài viết liên quan

Gợi ý cho bạn

Agent Loop — Nền tảng xây dựng AI Agent với Claude

Orchestrator-Workers — Kiến trúc điều phối agent phức tạp

Human-in-the-Loop — Khi nào AI Agent cần xin phép con người

Retrieval Agent — Xây dựng Agentic RAG với Claude

Tin liên quan nên xem

Voice Assistant với ElevenLabs + Claude — Trợ lý giọng nói

Testing AI Agent — Framework đánh giá và kiểm thử agent production

Memory Management — Quản lý bộ nhớ dài hạn cho Claude agents

Xây dựng LLM Agent từ đầu — Reference Implementation

Đăng ký nhận bản tin