Bài tập — End-to-end eval cho app của bạn

4 — Prompt EvaluationTrung cấp90 phút

Bạn sẽ học được
  • Build full eval pipeline cho 1 prompt
  • Iterate prompt qua 3 version với measurable improvement
  • Generate report với breakdown per case + per criterion
  • Ship prompt production-ready với confidence 90%+

Đề bài

Chọn 1 prompt từ app của bạn (hoặc 1 scenario từ bài 6.20). Build full eval pipeline.

Deliverables

1. File dataset.json — 30+ test cases

Phải có:

2. File grader.py — Hybrid grader

  • 20 normal cases
  • 6 edge cases
  • 4 adversarial / boundary cases
[
  {"id": "case_001", "input": {...}, "category": "...", "difficulty": "..."},
  ...
]

2. File grader.py — Hybrid grader

Phải có:

3. File pipeline.py — Runner

  • 1 code validator (format / length / keyword)
  • 1 model grader (content quality)
  • Weighted combine
from typing import Dict

def grade(test_case: dict, output: str) -> Dict:
    """
    Returns:
      {
        "total_score": float,
        "breakdown": {"format": int, "content": int, "safety": int},
        "reasoning": str
      }
    """
    pass

3. File pipeline.py — Runner

4. File results_v1.json, results_v2.json, results_v3.json

Results của 3 iteration. Cùng dataset, khác prompt version.

5. File REPORT.md — Changelog + insights

async def run_eval_async(dataset: list, prompt_version: str) -> list:
    """Runs full eval pipeline, saves results to file."""
    pass

5. File REPORT.md — Changelog + insights

# Eval Report

## v1 Baseline
Score: X.X/10
Biggest gaps:
- ...

## v2 Changes
Applied: [technique]
Score: Y.Y/10 (+Z%)
Improvements:
- Cases 5, 12, 18 fixed (format issues)

## v3 Changes
Applied: [technique]
Score: A.A/10
Still struggling with:
- ...

## Final Decision
Ship v3.
Reason: Score > 8.5, stable across 3 runs.
Known limitations: [list]

## Future work
- Add category X to dataset
- Grader disagreement on case Y (need human calibration)

Quy trình đề xuất

Ngày 1 (2-3 giờ)

Phase 1: Setup (30 phút)

Phase 2: Dataset (1 giờ)

Phase 3: Grader (1 giờ)

Ngày 2 (2-3 giờ)

Phase 4: v1 baseline (30 phút)

Phase 5: v2 — Clear & Direct (45 phút)

Phase 6: v3 — Add technique (45 phút)

Ngày 3 (1 giờ)

Phase 7: Report + Ship (1 giờ)

  • Chọn prompt/app
  • Viết goal + success criteria (5 criteria checklist)
  • Create project folder
  • 5 manual seed cases
  • Claude generate 25 cases
  • Review + add 5 edge cases
  • Write grade_syntax (code)
  • Write grade_by_model với rubric
  • Human calibrate 10 cases
  • Simple prompt
  • Run eval
  • Review worst cases
  • Rewrite first line + structure
  • Re-run
  • Compare per-case
  • Pick technique từ grader feedback (examples? guidelines? XML?)
  • Re-run
  • Compare
  • Write REPORT.md
  • Run v3 3 lần → check stability
  • Commit all files

Reference implementation

# pipeline.py
import asyncio
import json
from anthropic import AsyncAnthropic

client = AsyncAnthropic()
MODEL = "claude-sonnet-5-20260205"


async def run_prompt_v1(test_case):
    """Baseline."""
    prompt = f"Please solve: {test_case['task']}"
    msg = await client.messages.create(
        model=MODEL, max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text


async def run_prompt_v2(test_case):
    """+ Clear & Direct + format instruction."""
    prompt = f"""Generate only the solution (code/JSON/regex only, no explanation):

{test_case['task']}"""
    msg = await client.messages.create(
        model=MODEL, max_tokens=1000,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "```code\n"}
        ],
        stop_sequences=["```"]
    )
    return msg.content[0].text


async def run_prompt_v3(test_case):
    """+ Guidelines + Examples."""
    prompt = f"""Generate code solution. Follow these rules:
1. Output only the code, no markdown, no explanation
2. For Python: valid syntax, no imports if not needed
3. For JSON: valid syntax, minify
4. For Regex: plain pattern, no delimiters

<example>
Task: Function to check even number
Output:
def is_even(n):
    return n % 2 == 0
</example>

Task: {test_case['task']}"""
    msg = await client.messages.create(
        model=MODEL, max_tokens=1000,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "```code\n"}
        ],
        stop_sequences=["```"]
    )
    return msg.content[0].text


# grader.py
import ast, json, re

def validate_python(text):
    try: ast.parse(text.strip()); return 10
    except: return 0

def validate_json(text):
    try: json.loads(text.strip()); return 10
    except: return 0

def validate_regex(text):
    try: re.compile(text.strip()); return 10
    except: return 0

async def grade_by_model(test_case, output):
    prompt = f"""Evaluate:
Task: {test_case['task']}
Output: {output}

Return JSON: {{"score": 1-10, "reasoning": "..."}}"""
    msg = await client.messages.create(
        model=MODEL, max_tokens=300,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "```json\n"}
        ],
        stop_sequences=["```"], temperature=0
    )
    return json.loads(msg.content[0].text.strip())


async def grade(test_case, output):
    # Code grader
    format_type = test_case.get("format", "text")
    code_score = {
        "python": validate_python,
        "json": validate_json,
        "regex": validate_regex
    }.get(format_type, lambda x: 10)(output)
    
    # Model grader
    model_result = await grade_by_model(test_case, output)
    model_score = model_result["score"]
    
    # Combine
    total = 0.4 * code_score + 0.6 * model_score
    
    return {
        "total_score": total,
        "code_score": code_score,
        "model_score": model_score,
        "reasoning": model_result["reasoning"]
    }


async def run_eval(dataset, prompt_fn, label):
    """Run eval and save results."""
    semaphore = asyncio.Semaphore(5)
    
    async def _run_case(case):
        async with semaphore:
            output = await prompt_fn(case)
            grade_result = await grade(case, output)
            return {
                "test_case": case,
                "output": output,
                **grade_result
            }
    
    results = await asyncio.gather(*(_run_case(c) for c in dataset))
    
    avg = sum(r["total_score"] for r in results) / len(results)
    print(f"{label}: {avg:.2f}/10")
    
    # Save
    with open(f"results_{label}.json", "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    return results, avg


# main.py
async def main():
    with open("dataset.json") as f:
        dataset = json.load(f)
    
    r1, s1 = await run_eval(dataset, run_prompt_v1, "v1")
    r2, s2 = await run_eval(dataset, run_prompt_v2, "v2")
    r3, s3 = await run_eval(dataset, run_prompt_v3, "v3")
    
    print(f"\nImprovement: {s1:.2f} → {s2:.2f} → {s3:.2f}")
    print(f"Delta: +{((s3-s1)/s1)*100:.0f}% over baseline")


asyncio.run(main())

Expected results

Cho AWS code task:

v3 ship được. Score stable qua 3 runs (8.5-8.8).

v1: 4.2/10
v2: 6.8/10  (+62%)
v3: 8.7/10  (+107%)

Breakdown v3:
  code_score avg: 9.5
  model_score avg: 8.2

Self-review checklist

Dataset

Grader

Pipeline

Iteration

Report

  • [ ] 30+ cases?
  • [ ] Có edge + adversarial?
  • [ ] Saved to dataset.json?
  • [ ] Consistent format field?
  • [ ] Code validator implemented?
  • [ ] Model grader có rubric rõ?
  • [ ] Combine strategy documented?
  • [ ] Human calibrated 10 cases?
  • [ ] Async với concurrency?
  • [ ] Retry logic?
  • [ ] Save results each run?
  • [ ] v1 baseline chạy?
  • [ ] v2 với 1 technique?
  • [ ] v3 với additional technique?
  • [ ] Per-case regression check?
  • [ ] Score table v1/v2/v3?
  • [ ] Explain technique applied mỗi version?
  • [ ] List known limitations?
  • [ ] Future work?

Common pitfalls

Pitfall 1: Dataset change giữa runs

Error: Generate dataset mới cho mỗi version → không so sánh được.

Fix: Generate 1 lần, save, reuse.

Pitfall 2: Grader disagree human

Error: Model score 9, human đọc thấy 4.

Fix: Refine rubric. Add more concrete criteria. Calibrate again.

Pitfall 3: v3 tệ hơn v2 ở sub-category

Error: Average tăng nhưng 1 category giảm.

Fix: Review per-category scores. Có thể technique giúp nhóm A nhưng hại nhóm B.

Pitfall 4: Over-engineering prompt

Error: v4 longer + more guidelines → score stabilize → v5 bắt đầu giảm.

Fix: Sweet spot ở v3-v4. Stop iterating khi ROI diminishing.

Tóm tắt

🎯 Bài tập này là capstone Module 4 — bạn ship được 1 prompt production-ready.

🎯 Full pipeline = dataset + grader + pipeline + report. Mỗi file 1 responsibility.

🎯 Iterate 3 version, mỗi version 1 technique. Không tăng 3 thứ cùng lúc.

🎯 Report rõ score delta, known limitations. Audit trail cho team.

🎯 Ship khi score > 8, stable 3 runs, human calibrate OK.

Nội dung này có hữu ích không?
Kiểm tra kiến thức

Củng cố những gì bạn vừa học

12 câu trắc nghiệm · đạt từ 70% · câu hỏi và đáp án xáo trộn mỗi lần.