Với task "generate JSON", model grader có thể: - Trả về 7/10 dù JSON malformed (Claude không parse lại) - Tốn 500 tokens mỗi grade - Inconsistent score qua lần chạy
- Viết validator cho JSON, Python, regex, length, keyword
- Combine code + model graders thành hybrid scoring
- Update dataset format với format field để route validator đúng
- Biết khi nào code grader hiệu quả hơn model grader
3 validator cơ bản
Python syntax
ast.parse không execute code (safe), chỉ check syntax.
JSON
import ast
def validate_python(text: str) -> int:
"""Returns 10 if valid Python, 0 if syntax error."""
try:
ast.parse(text.strip())
return 10
except SyntaxError:
return 0JSON
Regex
import json
def validate_json(text: str) -> int:
try:
json.loads(text.strip())
return 10
except json.JSONDecodeError:
return 0Regex
Length
import re
def validate_regex(text: str) -> int:
try:
re.compile(text.strip())
return 10
except re.error:
return 0Length
Keyword presence
def validate_length(text: str, min_words: int, max_words: int) -> int:
words = len(text.split())
if min_words <= words <= max_words:
return 10
# Partial credit
if words < min_words:
return max(0, int(10 * words / min_words))
# Over limit
return max(0, 10 - (words - max_words) // 10)Keyword presence
Keyword absence (safety)
def validate_contains(text: str, required_keywords: list) -> int:
text_lower = text.lower()
found = sum(1 for kw in required_keywords if kw.lower() in text_lower)
return int(10 * found / len(required_keywords))Keyword absence (safety)
def validate_no_profanity(text: str) -> int:
blocked = ["word1", "word2"] # expand
if any(w in text.lower() for w in blocked):
return 0
return 10Route validator dựa vào format
Dataset format giờ có format field:
Update dataset generator:
[
{"task": "Python function to validate IAM username", "format": "python"},
{"task": "JSON for EventBridge rule on EC2 stop", "format": "json"},
{"task": "Regex to extract bucket name from ARN", "format": "regex"}
]Route validator dựa vào format (tiếp)
Route:
# Trong prompt generate_dataset:
prompt = f"""...
Each object must have:
- task: string description
- format: "python", "json", or "regex"
..."""Route validator dựa vào format (tiếp)
def grade_syntax(output: str, test_case: dict) -> int:
"""Route to correct validator based on expected format."""
format_type = test_case.get("format", "text")
if format_type == "python":
return validate_python(output)
elif format_type == "json":
return validate_json(output)
elif format_type == "regex":
return validate_regex(output)
else:
return 10 # No validation for plain textCombine code + model grading
Hybrid score:
Weighted combine
Nếu syntax cực quan trọng (sản phẩm code), weight nhiều hơn:
def grade(test_case: dict, output: str) -> dict:
# Model grader for content quality
model_result = grade_by_model(test_case, output)
# Code grader for format/syntax
syntax_score = grade_syntax(output, test_case)
# Combine (equal weight)
final_score = (model_result["score"] + syntax_score) / 2
return {
"final_score": final_score,
"model_score": model_result["score"],
"syntax_score": syntax_score,
"reasoning": model_result["reasoning"]
}Weighted combine
Format fail → score tối đa 3 (không matter content).
final_score = 0.3 * model_result["score"] + 0.7 * syntax_scoreImproved prompt dựa trên grader feedback
Chạy v1 → scores low, mainly vì format issues (markdown wrap).
Update prompt:
Re-run eval. Score:
v2 syntax jumped (3.5 → 9.1) nhờ prefill. Model score hơi tăng.
Insight: Code grader reveals format issues rõ ràng hơn model grader.
| Version | Model | Syntax | Total |
|---|---|---|---|
| v1 | 7.2 | 3.5 | 5.35 |
| v2 | 7.8 | 9.1 | 8.45 |
def run_prompt_v2(test_case):
prompt = f"""Generate only the code solution for:
{test_case['task']}
Respond with only the code. No explanations, no markdown wrappers."""
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": "```code"} # prefill
]
return chat(messages, stop_sequences=["```"])More validators (advanced)
Validate JSON schema
Validate Python runs (execute)
Careful: chỉ dùng trong sandbox.
import jsonschema
def validate_json_schema(text: str, schema: dict) -> int:
try:
data = json.loads(text)
jsonschema.validate(data, schema)
return 10
except (json.JSONDecodeError, jsonschema.ValidationError):
return 0Validate Python runs (execute)
Readability score
def validate_python_runs(text: str, test_input: str, expected: str) -> int:
"""Execute code, check output. UNSAFE for untrusted code!"""
# Chỉ dùng sandbox (Docker, Firejail)
...Readability score
SQL valid
import textstat
def validate_readability(text: str, target_grade: int) -> int:
grade = textstat.flesch_kincaid_grade(text)
diff = abs(grade - target_grade)
if diff < 1:
return 10
return max(0, 10 - int(diff * 2))SQL valid
import sqlparse
def validate_sql(text: str) -> int:
parsed = sqlparse.parse(text)
if parsed and parsed[0].tokens:
return 10
return 0Case study: Full hybrid grader
Report output:
def comprehensive_grade(test_case: dict, output: str) -> dict:
scores = {}
# Format validation (binary)
scores["format"] = grade_syntax(output, test_case)
# Length check
scores["length"] = validate_length(output, 10, 500)
# Content quality (model)
model = grade_by_model(test_case, output)
scores["content"] = model["score"]
# Safety
scores["safety"] = validate_no_profanity(output)
# Weighted total
weights = {"format": 0.3, "length": 0.1, "content": 0.5, "safety": 0.1}
total = sum(scores[k] * weights[k] for k in scores)
return {
"total_score": total,
"breakdown": scores,
"reasoning": model["reasoning"]
}Case study: Full hybrid grader (tiếp)
Clear attribution — biết cái gì fail.
Case 1:
Format: 10
Length: 8
Content: 7
Safety: 10
Total: 7.9Anti-patterns
❌ Code grade subjective thing
"Evaluate creativity" với code → impossible.
Fix: Code cho measurable, model cho subjective.
❌ Weight ngược
Format weight 0.1, content 0.9. Malformed output vẫn score cao → downstream crash.
Fix: Weight theo business impact. Format critical → weight cao.
❌ Không handle edge validator error
ast.parse(None) → crash.
Fix: Try-except wrapping, return 0 on any error.
❌ Validator quá strict
Regex ^[a-z]+$ cho output → reject output có uppercase dù valid.
Fix: Design validator theo spec thật, test với known-good examples.
Áp dụng ngay
Bài tập 1: Viết 3 code validators (20 phút)
Cho app của bạn, viết:
Bài tập 2: Run hybrid grader (20 phút)
Combine code + model grader. Chạy eval full. So sánh score vs model-only.
Expect: code grader catch format issues model grader miss.
def validate_format(output: str) -> int:
# check format của output (JSON/XML/plain/...)
pass
def validate_length(output: str) -> int:
# check length trong range expected
pass
def validate_content(output: str) -> int:
# check có keyword required, không có blocked words
passTóm tắt
🎯 Code grader cho measurable — JSON, Python syntax, length, keyword.
🎯 Model grader cho subjective — quality, tone, completeness.
🎯 Hybrid (weighted combine) là production pattern.
🎯 Format nên weight cao — malformed output = downstream crash.
🎯 Dataset có format field để route validator đúng.