Code-based grading — Validation bằng Python

4 — Prompt EvaluationTrung cấp20 phút

Với task "generate JSON", model grader có thể: - Trả về 7/10 dù JSON malformed (Claude không parse lại) - Tốn 500 tokens mỗi grade - Inconsistent score qua lần chạy

Bạn sẽ học được
  • Viết validator cho JSON, Python, regex, length, keyword
  • Combine code + model graders thành hybrid scoring
  • Update dataset format với format field để route validator đúng
  • Biết khi nào code grader hiệu quả hơn model grader

3 validator cơ bản

Python syntax

ast.parse không execute code (safe), chỉ check syntax.

JSON

import ast

def validate_python(text: str) -> int:
    """Returns 10 if valid Python, 0 if syntax error."""
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

JSON

Regex

import json

def validate_json(text: str) -> int:
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

Regex

Length

import re

def validate_regex(text: str) -> int:
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0

Length

Keyword presence

def validate_length(text: str, min_words: int, max_words: int) -> int:
    words = len(text.split())
    if min_words <= words <= max_words:
        return 10
    # Partial credit
    if words < min_words:
        return max(0, int(10 * words / min_words))
    # Over limit
    return max(0, 10 - (words - max_words) // 10)

Keyword presence

Keyword absence (safety)

def validate_contains(text: str, required_keywords: list) -> int:
    text_lower = text.lower()
    found = sum(1 for kw in required_keywords if kw.lower() in text_lower)
    return int(10 * found / len(required_keywords))

Keyword absence (safety)

def validate_no_profanity(text: str) -> int:
    blocked = ["word1", "word2"]  # expand
    if any(w in text.lower() for w in blocked):
        return 0
    return 10

Route validator dựa vào format

Dataset format giờ có format field:

Update dataset generator:

[
  {"task": "Python function to validate IAM username", "format": "python"},
  {"task": "JSON for EventBridge rule on EC2 stop", "format": "json"},
  {"task": "Regex to extract bucket name from ARN", "format": "regex"}
]

Route validator dựa vào format (tiếp)

Route:

# Trong prompt generate_dataset:
prompt = f"""...
Each object must have:
- task: string description
- format: "python", "json", or "regex"
..."""

Route validator dựa vào format (tiếp)

def grade_syntax(output: str, test_case: dict) -> int:
    """Route to correct validator based on expected format."""
    format_type = test_case.get("format", "text")
    
    if format_type == "python":
        return validate_python(output)
    elif format_type == "json":
        return validate_json(output)
    elif format_type == "regex":
        return validate_regex(output)
    else:
        return 10  # No validation for plain text

Combine code + model grading

Hybrid score:

Weighted combine

Nếu syntax cực quan trọng (sản phẩm code), weight nhiều hơn:

def grade(test_case: dict, output: str) -> dict:
    # Model grader for content quality
    model_result = grade_by_model(test_case, output)
    
    # Code grader for format/syntax
    syntax_score = grade_syntax(output, test_case)
    
    # Combine (equal weight)
    final_score = (model_result["score"] + syntax_score) / 2
    
    return {
        "final_score": final_score,
        "model_score": model_result["score"],
        "syntax_score": syntax_score,
        "reasoning": model_result["reasoning"]
    }

Weighted combine

Format fail → score tối đa 3 (không matter content).

final_score = 0.3 * model_result["score"] + 0.7 * syntax_score

Improved prompt dựa trên grader feedback

Chạy v1 → scores low, mainly vì format issues (markdown wrap).

Update prompt:

Re-run eval. Score:

v2 syntax jumped (3.5 → 9.1) nhờ prefill. Model score hơi tăng.

Insight: Code grader reveals format issues rõ ràng hơn model grader.

VersionModelSyntaxTotal
v17.23.55.35
v27.89.18.45
def run_prompt_v2(test_case):
    prompt = f"""Generate only the code solution for:
{test_case['task']}

Respond with only the code. No explanations, no markdown wrappers."""
    
    messages = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": "```code"}  # prefill
    ]
    
    return chat(messages, stop_sequences=["```"])

More validators (advanced)

Validate JSON schema

Validate Python runs (execute)

Careful: chỉ dùng trong sandbox.

import jsonschema

def validate_json_schema(text: str, schema: dict) -> int:
    try:
        data = json.loads(text)
        jsonschema.validate(data, schema)
        return 10
    except (json.JSONDecodeError, jsonschema.ValidationError):
        return 0

Validate Python runs (execute)

Readability score

def validate_python_runs(text: str, test_input: str, expected: str) -> int:
    """Execute code, check output. UNSAFE for untrusted code!"""
    # Chỉ dùng sandbox (Docker, Firejail)
    ...

Readability score

SQL valid

import textstat

def validate_readability(text: str, target_grade: int) -> int:
    grade = textstat.flesch_kincaid_grade(text)
    diff = abs(grade - target_grade)
    if diff < 1:
        return 10
    return max(0, 10 - int(diff * 2))

SQL valid

import sqlparse

def validate_sql(text: str) -> int:
    parsed = sqlparse.parse(text)
    if parsed and parsed[0].tokens:
        return 10
    return 0

Case study: Full hybrid grader

Report output:

def comprehensive_grade(test_case: dict, output: str) -> dict:
    scores = {}
    
    # Format validation (binary)
    scores["format"] = grade_syntax(output, test_case)
    
    # Length check
    scores["length"] = validate_length(output, 10, 500)
    
    # Content quality (model)
    model = grade_by_model(test_case, output)
    scores["content"] = model["score"]
    
    # Safety
    scores["safety"] = validate_no_profanity(output)
    
    # Weighted total
    weights = {"format": 0.3, "length": 0.1, "content": 0.5, "safety": 0.1}
    total = sum(scores[k] * weights[k] for k in scores)
    
    return {
        "total_score": total,
        "breakdown": scores,
        "reasoning": model["reasoning"]
    }

Case study: Full hybrid grader (tiếp)

Clear attribution — biết cái gì fail.

Case 1:
  Format: 10
  Length: 8
  Content: 7
  Safety: 10
  Total: 7.9

Anti-patterns

❌ Code grade subjective thing

"Evaluate creativity" với code → impossible.

Fix: Code cho measurable, model cho subjective.

❌ Weight ngược

Format weight 0.1, content 0.9. Malformed output vẫn score cao → downstream crash.

Fix: Weight theo business impact. Format critical → weight cao.

❌ Không handle edge validator error

ast.parse(None) → crash.

Fix: Try-except wrapping, return 0 on any error.

❌ Validator quá strict

Regex ^[a-z]+$ cho output → reject output có uppercase dù valid.

Fix: Design validator theo spec thật, test với known-good examples.

Áp dụng ngay

Bài tập 1: Viết 3 code validators (20 phút)

Cho app của bạn, viết:

Bài tập 2: Run hybrid grader (20 phút)

Combine code + model grader. Chạy eval full. So sánh score vs model-only.

Expect: code grader catch format issues model grader miss.

def validate_format(output: str) -> int:
    # check format của output (JSON/XML/plain/...)
    pass

def validate_length(output: str) -> int:
    # check length trong range expected
    pass

def validate_content(output: str) -> int:
    # check có keyword required, không có blocked words
    pass

Tóm tắt

🎯 Code grader cho measurable — JSON, Python syntax, length, keyword.

🎯 Model grader cho subjective — quality, tone, completeness.

🎯 Hybrid (weighted combine) là production pattern.

🎯 Format nên weight cao — malformed output = downstream crash.

🎯 Dataset có format field để route validator đúng.

Nội dung này có hữu ích không?