Workflow eval 5 bước — Building with the Claude API

`` ┌───────────────────────────────────────────────────┐ │ │ │ Bước 1: Draft prompt │ │ │ │ │ ▼ │ │ Bước 2: Create eval dataset │ │ │ │ │ ▼ │ │ Bước 3: Feed through Claude (get outputs) │ │ │ │ │ ▼ │ │ Bước 4: Feed through Grader (get scores) │ │ │ │ │ ▼ │ │ Bước 5: Change prompt → repeat 3-4 │ │ │ └───────────────────────────────────────────────────┘ ``

Bạn sẽ học được

Thuộc 5 bước của typical eval workflow
Phân biệt vai trò mỗi bước — draft / dataset / run / grade / iterate
Đọc được score report và biết cần làm gì tiếp
Tránh nhầm lẫn giữa run 1 lần vs iterate loop

Bước 1: Draft prompt

Baseline — đơn giản nhất có thể.

Không engineering, không guidelines. Goal: có baseline để compare.

Pitfall: Bắt đầu với prompt đã engineering → không biết gain từ engineering bao nhiêu.

prompt_template = """Please answer the user's question:

{question}"""

Bước 2: Create eval dataset

Dataset là list input mà bạn sẽ test prompt với.

3 cách tạo dataset

1. Manual curation

2. User logs

3. Claude generate (synthetic)

Production thường dùng hybrid: manual + synthetic + logs.

Chi tiết ở bài 6.24.

Bạn tự viết test case
Pros: chính xác với domain
Cons: slow, có thể bias
Extract real queries từ prod
Pros: đại diện user thật
Cons: chưa có logs nếu đang pre-launch
Prompt Claude tạo test case
Pros: scale nhanh
Cons: có thể không cover distribution thật

dataset = [
    {"question": "What's 2+2?"},
    {"question": "How do I make oatmeal?"},
    {"question": "How far away is the Moon?"},
]

Bước 3: Feed through Claude

Merge mỗi case với prompt template → gửi lên Claude → lưu output.

Example:

Input: "What's 2+2?" Output: "2 + 2 = 4"

Input: "How do I make oatmeal?" Output: "To make oatmeal: 1. Measure 1/2 cup oats... 2. Boil water... 3. Cook 5 min..."

Input: "How far is the Moon?" Output: "The Moon is approximately 384,400 km from Earth on average..."

Đơn giản là chạy prompt qua tất cả test case.

outputs = []
for case in dataset:
    prompt = prompt_template.format(**case)
    output = call_claude(prompt)
    outputs.append({
        "case": case,
        "output": output
    })

Bước 4: Feed through Grader

Grader chấm điểm mỗi output.

Ví dụ grader model-based:

for result in outputs:
    score = grade(result["case"], result["output"])
    result["score"] = score

Bước 4: Feed through Grader (tiếp)

Aggregate

Task: What's 2+2?
Output: 2 + 2 = 4
Grader: Score 10/10 — perfect, direct

Task: How do I make oatmeal?
Output: To make oatmeal: ...
Grader: Score 4/10 — correct but missing detail about water ratio, timing

Task: How far is the Moon?
Output: ~384,400 km
Grader: Score 9/10 — accurate, concise

Aggregate

Đây là baseline score của prompt v1.

avg_score = sum(r["score"] for r in outputs) / len(outputs)
# (10 + 4 + 9) / 3 = 7.67

Bước 5: Change prompt → repeat

Dựa trên score + grader feedback, iterate prompt:

Chạy lại từ bước 3 với template mới:

Results v2:

v2 cải thiện. Nhưng "moon" giảm 1 điểm (có thể vì verbose hơn cần thiết).

Decision:

v2 cao hơn → ship v2, hoặc
Iterate v3 để fix moon regression

Case	v1 score	v2 score
2+2	10	10
oatmeal	4	9
moon	9	8
Avg	7.67	9.0

prompt_template_v2 = """Please answer the user's question:

{question}

Answer with ample detail."""

Case study: AWS code generator

Goal

Prompt generate Python/JSON/Regex cho AWS tasks.

Dataset (3 cases)

v1 Prompt (baseline)

[
  {"task": "Create Python function to validate IAM username"},
  {"task": "Write JSON for CloudWatch Events rule for EC2 stops"},
  {"task": "Regex to extract S3 bucket name from ARN"}
]

v1 Prompt (baseline)

v1 Run

Output quá verbose, có explanation, markdown wrap.

Score: 3.5 (format fail, syntax có nhưng không clean)

v2 — Clear & Direct + format instruction

prompt = f"Please provide a solution to:\n{task}"

v2 — Clear & Direct + format instruction

Score: 6.8 (format tốt hơn, thỉnh thoảng vẫn có markdown)

v3 — Add prefill ` `code `

prompt = f"""Generate only the code solution for:
{task}

Respond with only the code. No explanations, no markdown."""

v3 — Add prefill ` `code `

Score: 9.1 (clean code, no wrap)

Ship v3

Insight: Số liệu cụ thể (3.5 → 6.8 → 9.1) — không còn mơ hồ "prompt này hay hơn".

messages = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": "```code"}
]
# stop_sequences=["```"]

Scaling up

Ví dụ trên dùng 3 cases. Real production:

Lưu ý: Eval không cần chạy mỗi commit. Thường:

Pre-commit: 30 cases (fast)
CI nightly: 500 cases
Monthly: 5000 cases (comprehensive)

Scale	Dataset	Runtime	Cost
Prototype	10-30	1 min	$0.01
Dev	100-500	10 min	$0.50
Production	1000-5000	1-2 giờ	$5-20

Integration với prompt engineering

Module 3 (engineering) + Module 4 (eval) = workflow:

┌──────────────────────────────────────────────────┐
│                                                  │
│   1. Draft v1 (baseline)                         │
│   2. Eval v1 → score X                           │
│   3. Apply technique 1 (Clear & Direct)          │
│   4. Eval v2 → score Y                           │
│       X → Y cải thiện? → continue                │
│       Y <= X? → rollback, try another technique  │
│   5. Apply technique 2 (Specific)                │
│   6. Eval v3 → score Z                           │
│   ... repeat                                     │
│                                                  │
│   Stop khi:                                      │
│   - Score > threshold (ví dụ 8.5)                │
│   - ROI diminishing (v5 → v6 chỉ +0.1)           │
│                                                  │
└──────────────────────────────────────────────────┘

Anti-patterns

❌ Skip baseline

Bắt đầu với prompt engineered → không biết engineering đóng góp bao nhiêu.

Fix: Luôn có v1 baseline simple.

❌ Change 2+ things per iteration

"Thử cả examples + XML + guidelines cùng lúc" → score tăng, không biết cái nào.

Fix: 1 change / iteration. Isolate variable.

❌ Dataset quá nhỏ

3 cases → noise cao. Lần chạy 1 khác lần chạy 2 > 2 điểm → không reliable.

Fix: 30+ cases cho stable score.

❌ Không lưu output

Chạy eval, xem score, close notebook. Sau không tra được output để debug.

Fix: Save results to JSON file mỗi run.

❌ Aggregate che dấu regression

Avg v1: 7.0. Avg v2: 7.2. Tưởng tốt, thật ra case 5 giảm từ 10 → 3, case 6 tăng 4 → 8.

Fix: Review per-case score, không chỉ average.

Áp dụng ngay

Bài tập 1: Plan eval pipeline (15 phút)

Cho app của bạn, design eval:

Bài tập 2: Build mini-pipeline (30 phút)

Trong notebook, implement 3 function:

## My Eval Plan

Prompt: [tên prompt]

Dataset:
- Size: ___ cases
- Source: manual | synthetic | logs | hybrid
- Covers: [list categories/edge cases]

Run:
- Model: Sonnet / Haiku
- Temperature: [0-1]
- Concurrency: [1-5]

Grader:
- Type: code / model / hybrid
- Criteria: [list 3-5 criteria]
- Scale: 1-10 / pass-fail

Iteration budget:
- Time: [X] hours
- Cost: [Y] $
- Target score: ≥ [Z] / 10

Bài tập 2: Build mini-pipeline (30 phút)

Test với 3 cases. Chạy. Xem aggregate.

(Bài sau sẽ fill in grader thật.)

def run_prompt(test_case):
    """Merge + call Claude, return output string."""
    pass

def grade(test_case, output):
    """Return score 1-10. Start với hardcoded 10."""
    return 10

def run_eval(dataset):
    """Loop dataset, return list results."""
    pass

Tóm tắt

🎯 5 bước: Draft → Dataset → Run → Grade → Iterate.

🎯 Bước 5 là loop — không phải 1 lần xong. Repeat cho đến threshold.

🎯 1 change per iteration — isolate variable.

🎯 Per-case score quan trọng hơn average — catch regression.

🎯 30+ cases để score stable. 100+ cho production-critical.

Nội dung này có hữu ích không?