{
  "schemaVersion": "1.0",
  "item": {
    "slug": "reddi-llm-judge",
    "name": "Llm As Judge",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/nissan/reddi-llm-judge",
    "canonicalUrl": "https://clawhub.ai/nissan/reddi-llm-judge",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/reddi-llm-judge",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=reddi-llm-judge",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/reddi-llm-judge"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/reddi-llm-judge",
    "agentPageUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent",
    "manifestUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "LLM-as-Judge",
        "body": "Build a cost-efficient LLM evaluation ensemble for comparing and scoring generative AI outputs at scale."
      },
      {
        "title": "When to Use",
        "body": "Evaluating generative AI outputs across multiple models at scale (100+ runs)\nComparing local/OSS models against cloud baselines in shadow-testing pipelines\nBuilding promotion gates where models must prove quality before serving production traffic\nAny scenario where deterministic tests alone can't capture output quality"
      },
      {
        "title": "When NOT to Use",
        "body": "One-off evaluations (just read the output yourself)\nTasks with deterministic correct answers (use exact-match or unit tests)\nWhen you can't afford any external API calls (this pattern uses Claude/GPT as judges)"
      },
      {
        "title": "Layer 1: Deterministic Validators (Free, Instant)",
        "body": "Run on 100% of outputs. Zero cost. Catches obvious failures before burning judge tokens.\n\nJSON schema validation — does the output parse? Does it match the expected schema?\nRegex checks — required fields present, format constraints met\nLength bounds — output within acceptable min/max character count\nEntity presence — do required entities from the input appear in the output?\n\nIf Layer 1 fails, score is 0.0 — no need to invoke expensive judges."
      },
      {
        "title": "Layer 2: Heuristic Drift Detection (Cheap, Fast)",
        "body": "Run on 100% of outputs that pass Layer 1. Minimal cost (local computation only).\n\nEntity overlap — what fraction of entities in the ground truth appear in the candidate?\nNumerical consistency — do numbers in the output match source data?\nNovel fact detection — does the output introduce facts not present in the input/context? Novel facts suggest hallucination.\nStructural similarity — does the output follow the same structural pattern as ground truth?\n\nLayer 2 produces heuristic scores (0.0–1.0) that contribute to the final weighted score."
      },
      {
        "title": "Layer 3: LLM Judges (Expensive, High Quality)",
        "body": "Sampled at 15% of runs to control cost. Forced to 100% during promotion gates.\n\nTwo independent judges (e.g., Claude + GPT-4o) score the output. Each judge evaluates all 6 dimensions independently.\n\nTiebreaker pattern: When primary judges disagree by Δ ≥ 0.20 on any dimension, a third judge is invoked. The tiebreaker score replaces the outlier. This reduced score variance by 34% at only 8% additional cost."
      },
      {
        "title": "The 6 Scoring Dimensions",
        "body": "DimensionWeightWhat It MeasuresStructural accuracy0.20Format compliance, schema adherenceSemantic similarity0.25Meaning preservation vs ground truthFactual accuracy0.25Correctness of facts, numbers, entitiesTask completion0.15Does it actually answer the question?Tool use correctness0.05Valid tool calls (when applicable)Latency0.10Response time within acceptable bounds\n\nWeights are configurable per task type. Tool use weight is redistributed when not applicable."
      },
      {
        "title": "Critical Lesson: None ≠ 0.0",
        "body": "When a dimension is not sampled (LLM judge not invoked on this run), record the score as null, not 0.0. Unsampled dimensions must be excluded from the weighted average, not treated as failures.\n\nEarly bug: recording unsampled dimensions as 0.0 created a systematic 0.03–0.08 downward bias across all models. The fix: null means \"not measured\", which is fundamentally different from \"scored zero\".\n\n# WRONG — penalises unsampled dimensions\nweighted = sum(s * w for s, w in zip(scores, weights)) / sum(weights)\n\n# RIGHT — exclude null dimensions\npairs = [(s, w) for s, w in zip(scores, weights) if s is not None]\nweighted = sum(s * w for s, w in pairs) / sum(w for _, w in pairs)"
      },
      {
        "title": "Cost Estimate",
        "body": "With 15% LLM sampling, average cost per evaluated run: ~$0.003\n\nLayer 1 + Layer 2: $0.00 (local computation)\nLayer 3 (15% of runs): ~$0.02 per judged run × 0.15 = ~$0.003\nTiebreaker (fires ~12% of judged runs): adds ~$0.0003\n\nAt 200 runs for promotion: total judge cost ≈ $0.60 per model per task type."
      },
      {
        "title": "Worked Example: Summarisation Evaluation",
        "body": "from evaluation import JudgeEnsemble, DeterministicValidator, HeuristicScorer\n\n# Layer 1: must be valid text, 50-500 chars\nvalidator = DeterministicValidator(\n    min_length=50,\n    max_length=500,\n    required_format=\"text\",\n)\n\n# Layer 2: check entity overlap with source\nheuristic = HeuristicScorer(\n    check_entity_overlap=True,\n    check_novel_facts=True,\n    check_numerical_consistency=True,\n)\n\n# Layer 3: LLM judges (sampled)\nensemble = JudgeEnsemble(\n    judges=[\"claude-sonnet-4-20250514\", \"gpt-4o\"],\n    tiebreaker=\"claude-sonnet-4-20250514\",\n    sample_rate=0.15,\n    tiebreaker_threshold=0.20,\n    dimensions=[\"structural\", \"semantic\", \"factual\", \"completion\", \"latency\"],\n)\n\n# Evaluate\nresult = ensemble.evaluate(\n    task_type=\"summarize\",\n    ground_truth=gt_response,\n    candidate=candidate_response,\n    source_text=original_text,\n    validator=validator,\n    heuristic=heuristic,\n)\n\nprint(f\"Weighted score: {result.weighted_score:.3f}\")\nprint(f\"Dimensions: {result.scores}\")  # {semantic: 0.95, factual: 0.88, ...}\n# None values for unsampled dimensions"
      },
      {
        "title": "Tips",
        "body": "Start with Layer 1 — you'd be surprised how many outputs fail basic validation\nLog everything — store raw judge responses for debugging score disputes\nCalibrate on 50 runs — before trusting the ensemble, manually review 50 outputs against judge scores\nWatch for judge drift — LLM judges can be inconsistent across API versions; pin model versions\nForce judges at gates — 15% sampling is fine for monitoring, but promotion decisions need 100% coverage on the final batch"
      }
    ],
    "body": "LLM-as-Judge\n\nBuild a cost-efficient LLM evaluation ensemble for comparing and scoring generative AI outputs at scale.\n\nWhen to Use\nEvaluating generative AI outputs across multiple models at scale (100+ runs)\nComparing local/OSS models against cloud baselines in shadow-testing pipelines\nBuilding promotion gates where models must prove quality before serving production traffic\nAny scenario where deterministic tests alone can't capture output quality\nWhen NOT to Use\nOne-off evaluations (just read the output yourself)\nTasks with deterministic correct answers (use exact-match or unit tests)\nWhen you can't afford any external API calls (this pattern uses Claude/GPT as judges)\nArchitecture: Three-Layer Evaluation\nLayer 1: Deterministic Validators (Free, Instant)\n\nRun on 100% of outputs. Zero cost. Catches obvious failures before burning judge tokens.\n\nJSON schema validation — does the output parse? Does it match the expected schema?\nRegex checks — required fields present, format constraints met\nLength bounds — output within acceptable min/max character count\nEntity presence — do required entities from the input appear in the output?\n\nIf Layer 1 fails, score is 0.0 — no need to invoke expensive judges.\n\nLayer 2: Heuristic Drift Detection (Cheap, Fast)\n\nRun on 100% of outputs that pass Layer 1. Minimal cost (local computation only).\n\nEntity overlap — what fraction of entities in the ground truth appear in the candidate?\nNumerical consistency — do numbers in the output match source data?\nNovel fact detection — does the output introduce facts not present in the input/context? Novel facts suggest hallucination.\nStructural similarity — does the output follow the same structural pattern as ground truth?\n\nLayer 2 produces heuristic scores (0.0–1.0) that contribute to the final weighted score.\n\nLayer 3: LLM Judges (Expensive, High Quality)\n\nSampled at 15% of runs to control cost. Forced to 100% during promotion gates.\n\nTwo independent judges (e.g., Claude + GPT-4o) score the output. Each judge evaluates all 6 dimensions independently.\n\nTiebreaker pattern: When primary judges disagree by Δ ≥ 0.20 on any dimension, a third judge is invoked. The tiebreaker score replaces the outlier. This reduced score variance by 34% at only 8% additional cost.\n\nThe 6 Scoring Dimensions\nDimension\tWeight\tWhat It Measures\nStructural accuracy\t0.20\tFormat compliance, schema adherence\nSemantic similarity\t0.25\tMeaning preservation vs ground truth\nFactual accuracy\t0.25\tCorrectness of facts, numbers, entities\nTask completion\t0.15\tDoes it actually answer the question?\nTool use correctness\t0.05\tValid tool calls (when applicable)\nLatency\t0.10\tResponse time within acceptable bounds\n\nWeights are configurable per task type. Tool use weight is redistributed when not applicable.\n\nCritical Lesson: None ≠ 0.0\n\nWhen a dimension is not sampled (LLM judge not invoked on this run), record the score as null, not 0.0. Unsampled dimensions must be excluded from the weighted average, not treated as failures.\n\nEarly bug: recording unsampled dimensions as 0.0 created a systematic 0.03–0.08 downward bias across all models. The fix: null means \"not measured\", which is fundamentally different from \"scored zero\".\n\n# WRONG — penalises unsampled dimensions\nweighted = sum(s * w for s, w in zip(scores, weights)) / sum(weights)\n\n# RIGHT — exclude null dimensions\npairs = [(s, w) for s, w in zip(scores, weights) if s is not None]\nweighted = sum(s * w for s, w in pairs) / sum(w for _, w in pairs)\n\nCost Estimate\n\nWith 15% LLM sampling, average cost per evaluated run: ~$0.003\n\nLayer 1 + Layer 2: $0.00 (local computation)\nLayer 3 (15% of runs): ~$0.02 per judged run × 0.15 = ~$0.003\nTiebreaker (fires ~12% of judged runs): adds ~$0.0003\n\nAt 200 runs for promotion: total judge cost ≈ $0.60 per model per task type.\n\nWorked Example: Summarisation Evaluation\nfrom evaluation import JudgeEnsemble, DeterministicValidator, HeuristicScorer\n\n# Layer 1: must be valid text, 50-500 chars\nvalidator = DeterministicValidator(\n    min_length=50,\n    max_length=500,\n    required_format=\"text\",\n)\n\n# Layer 2: check entity overlap with source\nheuristic = HeuristicScorer(\n    check_entity_overlap=True,\n    check_novel_facts=True,\n    check_numerical_consistency=True,\n)\n\n# Layer 3: LLM judges (sampled)\nensemble = JudgeEnsemble(\n    judges=[\"claude-sonnet-4-20250514\", \"gpt-4o\"],\n    tiebreaker=\"claude-sonnet-4-20250514\",\n    sample_rate=0.15,\n    tiebreaker_threshold=0.20,\n    dimensions=[\"structural\", \"semantic\", \"factual\", \"completion\", \"latency\"],\n)\n\n# Evaluate\nresult = ensemble.evaluate(\n    task_type=\"summarize\",\n    ground_truth=gt_response,\n    candidate=candidate_response,\n    source_text=original_text,\n    validator=validator,\n    heuristic=heuristic,\n)\n\nprint(f\"Weighted score: {result.weighted_score:.3f}\")\nprint(f\"Dimensions: {result.scores}\")  # {semantic: 0.95, factual: 0.88, ...}\n# None values for unsampled dimensions\n\nTips\nStart with Layer 1 — you'd be surprised how many outputs fail basic validation\nLog everything — store raw judge responses for debugging score disputes\nCalibrate on 50 runs — before trusting the ensemble, manually review 50 outputs against judge scores\nWatch for judge drift — LLM judges can be inconsistent across API versions; pin model versions\nForce judges at gates — 15% sampling is fine for monitoring, but promotion decisions need 100% coverage on the final batch"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/nissan/reddi-llm-judge",
    "publisherUrl": "https://clawhub.ai/nissan/reddi-llm-judge",
    "owner": "nissan",
    "version": "1.0.1",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/reddi-llm-judge",
    "downloadUrl": "https://openagent3.xyz/downloads/reddi-llm-judge",
    "agentUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent",
    "manifestUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/reddi-llm-judge/agent.md"
  }
}