- Build full eval pipeline cho 1 prompt
- Iterate prompt qua 3 version với measurable improvement
- Generate report với breakdown per case + per criterion
- Ship prompt production-ready với confidence 90%+
Đề bài
Chọn 1 prompt từ app của bạn (hoặc 1 scenario từ bài 6.20). Build full eval pipeline.
Deliverables
1. File dataset.json — 30+ test cases
Phải có:
2. File grader.py — Hybrid grader
- 20 normal cases
- 6 edge cases
- 4 adversarial / boundary cases
[
{"id": "case_001", "input": {...}, "category": "...", "difficulty": "..."},
...
]2. File grader.py — Hybrid grader
Phải có:
3. File pipeline.py — Runner
- 1 code validator (format / length / keyword)
- 1 model grader (content quality)
- Weighted combine
from typing import Dict
def grade(test_case: dict, output: str) -> Dict:
"""
Returns:
{
"total_score": float,
"breakdown": {"format": int, "content": int, "safety": int},
"reasoning": str
}
"""
pass3. File pipeline.py — Runner
4. File results_v1.json, results_v2.json, results_v3.json
Results của 3 iteration. Cùng dataset, khác prompt version.
5. File REPORT.md — Changelog + insights
async def run_eval_async(dataset: list, prompt_version: str) -> list:
"""Runs full eval pipeline, saves results to file."""
pass5. File REPORT.md — Changelog + insights
# Eval Report
## v1 Baseline
Score: X.X/10
Biggest gaps:
- ...
## v2 Changes
Applied: [technique]
Score: Y.Y/10 (+Z%)
Improvements:
- Cases 5, 12, 18 fixed (format issues)
## v3 Changes
Applied: [technique]
Score: A.A/10
Still struggling with:
- ...
## Final Decision
Ship v3.
Reason: Score > 8.5, stable across 3 runs.
Known limitations: [list]
## Future work
- Add category X to dataset
- Grader disagreement on case Y (need human calibration)Quy trình đề xuất
Ngày 1 (2-3 giờ)
Phase 1: Setup (30 phút)
Phase 2: Dataset (1 giờ)
Phase 3: Grader (1 giờ)
Ngày 2 (2-3 giờ)
Phase 4: v1 baseline (30 phút)
Phase 5: v2 — Clear & Direct (45 phút)
Phase 6: v3 — Add technique (45 phút)
Ngày 3 (1 giờ)
Phase 7: Report + Ship (1 giờ)
- Chọn prompt/app
- Viết goal + success criteria (5 criteria checklist)
- Create project folder
- 5 manual seed cases
- Claude generate 25 cases
- Review + add 5 edge cases
- Write grade_syntax (code)
- Write grade_by_model với rubric
- Human calibrate 10 cases
- Simple prompt
- Run eval
- Review worst cases
- Rewrite first line + structure
- Re-run
- Compare per-case
- Pick technique từ grader feedback (examples? guidelines? XML?)
- Re-run
- Compare
- Write REPORT.md
- Run v3 3 lần → check stability
- Commit all files
Reference implementation
# pipeline.py
import asyncio
import json
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
MODEL = "claude-sonnet-5-20260205"
async def run_prompt_v1(test_case):
"""Baseline."""
prompt = f"Please solve: {test_case['task']}"
msg = await client.messages.create(
model=MODEL, max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return msg.content[0].text
async def run_prompt_v2(test_case):
"""+ Clear & Direct + format instruction."""
prompt = f"""Generate only the solution (code/JSON/regex only, no explanation):
{test_case['task']}"""
msg = await client.messages.create(
model=MODEL, max_tokens=1000,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "```code\n"}
],
stop_sequences=["```"]
)
return msg.content[0].text
async def run_prompt_v3(test_case):
"""+ Guidelines + Examples."""
prompt = f"""Generate code solution. Follow these rules:
1. Output only the code, no markdown, no explanation
2. For Python: valid syntax, no imports if not needed
3. For JSON: valid syntax, minify
4. For Regex: plain pattern, no delimiters
<example>
Task: Function to check even number
Output:
def is_even(n):
return n % 2 == 0
</example>
Task: {test_case['task']}"""
msg = await client.messages.create(
model=MODEL, max_tokens=1000,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "```code\n"}
],
stop_sequences=["```"]
)
return msg.content[0].text
# grader.py
import ast, json, re
def validate_python(text):
try: ast.parse(text.strip()); return 10
except: return 0
def validate_json(text):
try: json.loads(text.strip()); return 10
except: return 0
def validate_regex(text):
try: re.compile(text.strip()); return 10
except: return 0
async def grade_by_model(test_case, output):
prompt = f"""Evaluate:
Task: {test_case['task']}
Output: {output}
Return JSON: {{"score": 1-10, "reasoning": "..."}}"""
msg = await client.messages.create(
model=MODEL, max_tokens=300,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "```json\n"}
],
stop_sequences=["```"], temperature=0
)
return json.loads(msg.content[0].text.strip())
async def grade(test_case, output):
# Code grader
format_type = test_case.get("format", "text")
code_score = {
"python": validate_python,
"json": validate_json,
"regex": validate_regex
}.get(format_type, lambda x: 10)(output)
# Model grader
model_result = await grade_by_model(test_case, output)
model_score = model_result["score"]
# Combine
total = 0.4 * code_score + 0.6 * model_score
return {
"total_score": total,
"code_score": code_score,
"model_score": model_score,
"reasoning": model_result["reasoning"]
}
async def run_eval(dataset, prompt_fn, label):
"""Run eval and save results."""
semaphore = asyncio.Semaphore(5)
async def _run_case(case):
async with semaphore:
output = await prompt_fn(case)
grade_result = await grade(case, output)
return {
"test_case": case,
"output": output,
**grade_result
}
results = await asyncio.gather(*(_run_case(c) for c in dataset))
avg = sum(r["total_score"] for r in results) / len(results)
print(f"{label}: {avg:.2f}/10")
# Save
with open(f"results_{label}.json", "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
return results, avg
# main.py
async def main():
with open("dataset.json") as f:
dataset = json.load(f)
r1, s1 = await run_eval(dataset, run_prompt_v1, "v1")
r2, s2 = await run_eval(dataset, run_prompt_v2, "v2")
r3, s3 = await run_eval(dataset, run_prompt_v3, "v3")
print(f"\nImprovement: {s1:.2f} → {s2:.2f} → {s3:.2f}")
print(f"Delta: +{((s3-s1)/s1)*100:.0f}% over baseline")
asyncio.run(main())Expected results
Cho AWS code task:
v3 ship được. Score stable qua 3 runs (8.5-8.8).
v1: 4.2/10
v2: 6.8/10 (+62%)
v3: 8.7/10 (+107%)
Breakdown v3:
code_score avg: 9.5
model_score avg: 8.2Self-review checklist
Dataset
Grader
Pipeline
Iteration
Report
- [ ] 30+ cases?
- [ ] Có edge + adversarial?
- [ ] Saved to dataset.json?
- [ ] Consistent format field?
- [ ] Code validator implemented?
- [ ] Model grader có rubric rõ?
- [ ] Combine strategy documented?
- [ ] Human calibrated 10 cases?
- [ ] Async với concurrency?
- [ ] Retry logic?
- [ ] Save results each run?
- [ ] v1 baseline chạy?
- [ ] v2 với 1 technique?
- [ ] v3 với additional technique?
- [ ] Per-case regression check?
- [ ] Score table v1/v2/v3?
- [ ] Explain technique applied mỗi version?
- [ ] List known limitations?
- [ ] Future work?
Common pitfalls
Pitfall 1: Dataset change giữa runs
Error: Generate dataset mới cho mỗi version → không so sánh được.
Fix: Generate 1 lần, save, reuse.
Pitfall 2: Grader disagree human
Error: Model score 9, human đọc thấy 4.
Fix: Refine rubric. Add more concrete criteria. Calibrate again.
Pitfall 3: v3 tệ hơn v2 ở sub-category
Error: Average tăng nhưng 1 category giảm.
Fix: Review per-category scores. Có thể technique giúp nhóm A nhưng hại nhóm B.
Pitfall 4: Over-engineering prompt
Error: v4 longer + more guidelines → score stabilize → v5 bắt đầu giảm.
Fix: Sweet spot ở v3-v4. Stop iterating khi ROI diminishing.
Tóm tắt
🎯 Bài tập này là capstone Module 4 — bạn ship được 1 prompt production-ready.
🎯 Full pipeline = dataset + grader + pipeline + report. Mỗi file 1 responsibility.
🎯 Iterate 3 version, mỗi version 1 technique. Không tăng 3 thứ cùng lúc.
🎯 Report rõ score delta, known limitations. Audit trail cho team.
🎯 Ship khi score > 8, stable 3 runs, human calibrate OK.
Củng cố những gì bạn vừa học
12 câu trắc nghiệm · đạt từ 70% · câu hỏi và đáp án xáo trộn mỗi lần.