evaluation-methodology

评估AI模型输出的方法 - 完全匹配、语义相似性、大语言模型作为评判者、比较评估、ELO排名。在衡量AI质量、构建评估流水线或比较模型时使用。

Evaluation Methodology

Methods for evaluating Foundation Model outputs.

Evaluation Approaches

1. Exact Evaluation

| Method | Use Case | Example | |--------|----------|---------| | Exact Match | QA, Math | "5" == "5" | | Functional Correctness | Code | Pass test cases | | BLEU/ROUGE | Translation | N-gram overlap | | Semantic Similarity | Open-ended | Embedding cosine |

# Semantic Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode([generated])
emb2 = model.encode([reference])
similarity = cosine_similarity(emb1, emb2)[0][0]

2. AI as Judge

JUDGE_PROMPT = """Rate the response on a scale of 1-5.

Criteria:
- Accuracy: Is information correct?
- Helpfulness: Does it address the need?
- Clarity: Is it easy to understand?

Query: {query}
Response: {response}

Return JSON: {"score": N, "reasoning": "..."}"""

# Multi-judge for reliability
judges = ["gpt-4", "claude-3"]
scores = [get_score(judge, response) for judge in judges]
final_score = sum(scores) / len(scores)

3. Comparative Evaluation (ELO)

COMPARE_PROMPT = """Compare these responses.

Query: {query}
A: {response_a}
B: {response_b}

Which is better? Return: A, B, or tie"""

def update_elo(rating_a, rating_b, winner, k=32):
    expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
    score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
    return rating_a + k * (score_a - expected_a)

Evaluation Pipeline

1. Define Criteria (accuracy, helpfulness, safety)
   ↓
2. Create Scoring Rubric with Examples
   ↓
3. Select Methods (exact + AI judge + human)
   ↓
4. Create Evaluation Dataset
   ↓
5. Run Evaluation
   ↓
6. Analyze & Iterate

Best Practices

Use multiple evaluation methods
Calibrate AI judges with human data
Include both automatic and human evaluation
Version your evaluation datasets
Track metrics over time
Test for position bias in comparisons