Model-based grading — Claude chấm Claude — Building with the Claude API

Bạn có 100 output. Cần điểm mỗi cái 1-10.

Bạn sẽ học được

Phân biệt 3 loại grader: code, model, human — và khi nào dùng cái nào
Viết model-based grader dùng Claude để đánh giá output của Claude
Thiết kế rubric và ask strengths/weaknesses để tránh bias "6/10"
Integrate grader vào pipeline của bài 6.25

3 loại grader

Code graders — khi nào?

Dùng cho tác vụ programmatically verifiable:

Dùng khi check có/không, đo đếm được.

Model graders — khi nào?

Dùng cho tác vụ subjective nhưng có thể describe bằng rubric:

Dùng khi code không check được, nhưng có thể list criteria cho Claude.

Human graders — khi nào?

Dùng cho edge cases, initial calibration, audit:

Dùng khi chất lượng yêu cầu cực cao, volume thấp.

Production thường hybrid: Code cho format + Model cho content + Human spot-check.

JSON syntax valid
Length trong [X, Y] words
Contains keyword / không contain profanity
Matches regex
Parse as Python syntax
Response quality (clear, concise, accurate)
Helpfulness
Tone match target
Completeness
Task following
Kiểm tra model grader có đúng không (10-20 cases ban đầu)
Ambiguous cases mà model grader disagree
High-stakes (legal, medical, safety)

Type	Flexibility	Speed	Cost	Consistency
Code	Low (chỉ check measurable)	Fast	Free	Perfect
Model	High (subjective dimensions)	Medium	$$	Medium
Human	Highest	Slowest	$$$$$	Low-Medium

Implement model grader

Version đơn giản (nhưng thường bias)

Vấn đề: Claude có bias về middle scores (6-7). Mỗi câu trả lời đều get ~6 → không phân biệt được.

Version tốt — Ask for strengths/weaknesses trước

def grade_simple(test_case, output):
    prompt = f"""Rate this output 1-10:

Task: {test_case['task']}
Output: {output}

Just return the number."""
    
    score_text = chat([{"role": "user", "content": prompt}])
    return float(score_text.strip())

Version tốt — Ask for strengths/weaknesses trước

Key insight: Asking Claude to articulate strengths/weaknesses first giúp Claude think, scores trở nên differentiated (1-10 spread, không bunched 6-7).

import json

def grade_by_model(test_case: dict, output: str) -> dict:
    """Model grader returning score + reasoning."""
    
    eval_prompt = f"""You are an expert reviewer. Evaluate this solution.

Task:
{test_case['task']}

Solution:
{output}

Evaluate against these criteria:
1. Correctness: Does the solution actually solve the task?
2. Format: Is output clean (no extra markdown/explanation)?
3. Completeness: Does it cover all aspects of the task?

Provide evaluation as JSON:
{{
  "strengths": ["strength 1", "strength 2"],
  "weaknesses": ["weakness 1", "weakness 2"],
  "reasoning": "Overall assessment in 2 sentences",
  "score": <integer 1-10>
}}

Scoring rubric:
- 10: Perfect — fully correct, clean format, complete
- 7-9: Good with minor issues
- 4-6: Partially correct, significant issues
- 1-3: Mostly wrong or off-topic"""

    messages = [
        {"role": "user", "content": eval_prompt},
        {"role": "assistant", "content": "```json\n"}
    ]
    
    msg = client.messages.create(
        model=model,
        max_tokens=500,
        messages=messages,
        stop_sequences=["```"],
        temperature=0
    )
    
    return json.loads(msg.content[0].text.strip())

Integrate vào pipeline

Update run_test_case:

Bonus: Lưu strengths/weaknesses giúp debug — biết vì sao case fail.

def run_test_case(test_case):
    output = run_prompt(test_case)
    
    # Grade với model
    grade = grade_by_model(test_case, output)
    
    return {
        "output": output,
        "test_case": test_case,
        "score": grade["score"],
        "reasoning": grade["reasoning"],
        "strengths": grade["strengths"],
        "weaknesses": grade["weaknesses"]
    }

Chạy eval với grader thật

Focus on worst cases — đó là nơi prompt cần cải thiện.

results = run_eval(dataset)

avg_score = sum(r["score"] for r in results) / len(results)
print(f"Average: {avg_score:.2f}")

# Inspect worst cases
worst = sorted(results, key=lambda r: r["score"])[:3]
print("\n=== WORST CASES ===")
for r in worst:
    print(f"\nCase: {r['test_case']['task']}")
    print(f"Score: {r['score']}")
    print(f"Weaknesses: {r['weaknesses']}")

Rubric customization

Với domain cụ thể, rubric chi tiết hơn. Ví dụ cho email drafter:

Breaking down thành sub-scores giúp:

Debug chính xác criteria fail
Weighted scoring (criteria quan trọng hơn weight cao)
Track improvement per criterion

eval_prompt = f"""Evaluate this email draft.

Context: {test_case['context']}
Draft: {output}

Criteria:
1. Tone match (professional but warm): 0-3 points
2. Addresses customer's concern: 0-3 points
3. Clear next steps: 0-2 points
4. Length appropriate (100-150 words): 0-2 points

Total score = sum (max 10).

Return JSON:
{{
  "criterion_scores": {{"tone": X, "concern": Y, "next_steps": Z, "length": W}},
  "total_score": <sum>,
  "reasoning": "..."
}}"""

Meta-evaluation: Grader có đúng không?

Vấn đề: Model grader có thể biased, có thể wrong.

Solution: Human spot-check 10-20 cases.

Đếm agreement rate:

Cycle:

Agreement > 80% → grader đáng tin
< 70% → rubric chưa rõ, refine prompt grader
Build grader v1
Human calibrate 20 cases
Refine grader prompt based on disagreements
Human re-calibrate 20 cases
Agreement > 80% → production

Case 1: Model grader: 8/10. Bạn đọc → đồng ý.
Case 2: Model grader: 3/10. Bạn đọc → đồng ý.
Case 3: Model grader: 9/10. Bạn đọc → nhưng có bug. Disagree.
...

Case study: Meal planner grader

Mỗi criterion có max point rõ → grader đi theo checklist thay vì đánh giá ấn tượng tổng thể.

def grade_meal_plan(test_case, output):
    eval_prompt = f"""Evaluate this athlete meal plan.

Athlete:
- Height: {test_case['height']} cm
- Weight: {test_case['weight']} kg
- Goal: {test_case['goal']}
- Restrictions: {test_case['restrictions']}

Meal plan:
{output}

Criteria (10 points total):
1. Caloric accuracy (target ±200 cal of what athlete needs): 0-3
2. Macro breakdown included (protein/fat/carb in g): 0-2
3. Meal timing specified: 0-2
4. Respects restrictions (no forbidden foods): 0-2
5. Portion sizes in grams (not "a handful"): 0-1

JSON output:
{{
  "criterion_scores": {{"cal": X, "macro": Y, "timing": Z, "restrictions": W, "portion": V}},
  "total_score": <sum>,
  "strengths": ["..."],
  "weaknesses": ["..."],
  "reasoning": "..."
}}"""
    # ... call Claude, parse, return

Anti-patterns

❌ Ask score without reasoning

"Score 1-10" → Claude default ~6.

Fix: Ask strengths/weaknesses first, score sau.

❌ Rubric vague

"Evaluate helpfulness" → ambiguous.

Fix: Break xuống sub-criteria measurable.

❌ Dùng cùng model cho run và grade

Run prompt v1 với Sonnet, grade với Sonnet → grader có thể bias thích output style của nó.

Fix: Dùng model khác (ví dụ: run Haiku, grade Sonnet). Hoặc clearly rubric-based để giảm model bias.

❌ Temperature > 0 cho grader

Grader cần reproducible score.

Fix: temperature=0.

❌ Skip meta-evaluation

Deploy grader mà chưa calibrate → grader có thể wrong consistently.

Fix: Human spot-check 20 cases trước khi trust grader.

Áp dụng ngay

Bài tập 1: Build grader cho prompt của bạn (30 phút)

Trong notebook:

Bài tập 2: Meta-evaluate (20 phút)

Chọn 15 outputs đã grade bằng model. Human grade (bạn đọc và chấm).

Compare:

Tính agreement rate = #agree / 15.

Nếu < 70% → refine rubric. Lặp lại.

List 3-5 criteria cho output
Viết eval_prompt với rubric detail
Implement grade_by_model function
Chạy trên 10 outputs, review scores
Model grade 10, bạn grade 8 → disagree 2
Model grade 5, bạn grade 5 → agree
...

Tóm tắt

🎯 3 loại grader: Code (measurable), Model (subjective), Human (high-stakes).

🎯 Model grader flexible nhất cho dimensions chủ quan.

🎯 Ask strengths/weaknesses trước score — tránh bias "6/10".

🎯 Temperature=0 + rubric rõ cho reproducible grading.

🎯 Meta-evaluate: human calibrate 20 cases trước khi trust grader.

Nội dung này có hữu ích không?