返回 Skill 列表
extension
分类: 开发与工程需要 API Key

LLM Evaluator Pro

基于Langfuse的LLM裁判评估器。使用GPT-5-nano作为裁判,从相关性、准确性、幻觉和实用性对追踪记录进行评分。支持单条追踪...

person作者: aiwithabidihubclawhub

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

  • Evaluating quality of search results or AI responses
  • Scoring traces for relevance, accuracy, hallucination detection
  • Batch scoring recent unscored traces
  • Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

| Evaluator | Measures | Scale | |-----------|----------|-------| | relevance | Response relevance to query | 0–1 | | accuracy | Factual correctness | 0–1 | | hallucination | Made-up information detection | 0–1 | | helpfulness | Overall usefulness | 0–1 |

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation