# Send Llm Eval Router to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "llm-eval-router",
    "name": "Llm Eval Router",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/nissan/llm-eval-router",
    "canonicalUrl": "https://clawhub.ai/nissan/llm-eval-router",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/llm-eval-router",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=llm-eval-router",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "SKILL.md"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "slug": "llm-eval-router",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-02T05:56:13.623Z",
      "expiresAt": "2026-05-09T05:56:13.623Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=llm-eval-router",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=llm-eval-router",
        "contentDisposition": "attachment; filename=\"llm-eval-router-1.2.2.zip\"",
        "redirectLocation": null,
        "bodySnippet": null,
        "slug": "llm-eval-router"
      },
      "scope": "item",
      "summary": "Item download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this item.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/llm-eval-router"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/llm-eval-router",
    "downloadUrl": "https://openagent3.xyz/downloads/llm-eval-router",
    "agentUrl": "https://openagent3.xyz/skills/llm-eval-router/agent",
    "manifestUrl": "https://openagent3.xyz/skills/llm-eval-router/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/llm-eval-router/agent.md"
  }
}
```
## Documentation

### llm-eval-router

Set up a production-quality shadow evaluation pipeline that automatically
promotes local Ollama models when they statistically prove they match cloud
model quality — reducing inference costs with evidence, not hope.

### The core idea

Run every task through your best local model (shadow) in parallel with your
cloud baseline (ground truth). A lightweight judge ensemble scores the local
output. After 200+ runs, if the local model hits 0.95 mean score, promote it
to handle that task type in production. Demote it automatically if quality drops.

### When to use

You're paying for Claude/GPT API calls on tasks that don't need that quality
You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.)
You want evidence-based cost reduction, not blind routing
You have defined task types: summarize, classify, extract, format, analyze, RAG

### When NOT to use

Tasks that require real-time web knowledge (use cloud)
Tasks with strict latency requirements < 2 seconds (local models on CPU are slow)
Tasks with high safety stakes (always use cloud with safety filters)
You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)

### Prerequisites

Ollama installed and running (ollama.com)
At least one capable model: ollama pull qwen2.5 or ollama pull phi4
Python 3.10+
API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker)
Langfuse for observability (self-hosted or cloud) — optional but strongly recommended

### Network & Privacy

This skill makes outbound API calls to:

Anthropic API — to generate ground truth baseline responses (every accumulation cycle)
OpenAI API — for judge scoring (sampled at 15% of runs)
Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20)

What stays local:

All Ollama model inference runs entirely on your device
Scored run data is stored on disk in data/scores/*.json
No telemetry, analytics, or data collection of any kind
No data is sent anywhere other than the explicit API calls above

Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.

### 6-Dimension Evaluation

Every response is scored on:

DimensionDefault weightAnalyze weightWhat it measuresStructural25%10%Format compliance, required keys presentSemantic25%40%Meaning equivalence to ground truthFactual20%25%No hallucinated facts/numbers/entitiesCompletion15%18%Task fully addressedTool use10%4%Correct tool/format selectionLatency5%3%Within acceptable bounds

Important: Use per-task weight overrides. The default 25/25 split treats structural
accuracy equally with semantic similarity — which works for extract/classify/format tasks
(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher
on two prose analyses of the same question scores ~0.29 even when they're semantically
identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.

# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
    "analyze": {
        "structural_accuracy": 0.10,   # difflib is NOT meaningful for prose
        "semantic_similarity": 0.40,   # cosine over embeddings captures meaning
        "factual_drift": 0.25,
        "task_completion": 0.18,
        "tool_correctness": 0.04,
        "latency_score": 0.03,
    },
    "code_transform": {
        "structural_accuracy": 0.15,
        "semantic_similarity": 0.35,
        "factual_drift": 0.20,
        "task_completion": 0.20,
        "tool_correctness": 0.07,
        "latency_score": 0.03,
    },
}

Also: For analyze tasks, constrain output structure via system_prompt so GT and
candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning).
This reduces Layer 2 drift and improves difflib scores even at reduced weight.

### Judge ensemble

Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently
Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash
Unsampled runs (85%): Layer 1+2 validators only (deterministic, free)
Promotion gates always trigger full judge evaluation regardless of sampling rate

### Layer 1+2 validators (free, deterministic)

Layer 1: JSON validity, required key presence, forbidden pattern check
Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth

These run on every response at zero cost. Judges only run when L1+L2 pass and
the sampling rate triggers.

### Promotion / Demotion

Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair
Demote: rolling 7-day pass rate < 0.92
Control floor: one model (phi4, granite4, or similar) serves as the measured floor —
any model scoring below it should be flagged, not promoted

### Step 1 — Define your task types

Create config/task_types.yaml:

tasks:
  - id: summarize
    description: "Summarize a document in N sentences"
    require_json: false
    judge_dimensions: [semantic, factual, completion]

  - id: classify
    description: "Classify text into one of N categories"
    require_json: true    # response must be valid JSON
    judge_dimensions: [structural, semantic, completion]

  - id: extract
    description: "Extract structured data from unstructured text"
    require_json: true
    judge_dimensions: [structural, factual, completion]

  - id: format
    description: "Reformat content to match a template"
    require_json: false
    judge_dimensions: [structural, semantic, completion]

### Step 2 — Set up the router

The router assigns each task to a model using a round-robin strategy during
burn-in (building n), then switches to confidence-weighted routing after promotion.

# src/router.py — simplified version
class Router:
    def __init__(self, candidates: list[str], control_floor: str):
        self.candidates = candidates
        self.control_floor = control_floor
        self._rr_counters = defaultdict(int)

    def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
        """Return the best model for this task type."""
        promoted = confidence_tracker.get_promoted(task_type)
        if promoted:
            return promoted  # use promoted model directly

        # Round-robin during burn-in for fair exposure
        idx = self._rr_counters[task_type] % len(self.candidates)
        self._rr_counters[task_type] += 1
        return self.candidates[idx]

### Step 3 — Ground truth comparison

For each task, run it through BOTH the local model (candidate) and the cloud
baseline (ground truth). Never use the ground truth response in production —
it's only for evaluation.

async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
                        task_type: str) -> float:
    # Layer 1: deterministic
    l1_score = validators.layer1(local_response, task_type)
    if l1_score == 0.0:
        return 0.0  # hard fail — safety or format violation

    # Layer 2: heuristic drift
    l2_score = validators.layer2(local_response, gt_response)

    # Sample judges (15%)
    if random.random() < JUDGE_SAMPLE_RATE:
        sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
        mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
        if abs(sonnet_score - mini_score) >= 0.20:
            gemini_score = await judge_gemini(prompt, local_response, gt_response)
            final = median([sonnet_score, mini_score, gemini_score])
        else:
            final = (sonnet_score + mini_score) / 2
        return weighted_score(l1_score, l2_score, final)
    else:
        return weighted_score(l1_score, l2_score, judge_score=None)

### Step 4 — Confidence tracker

Track scores per model/task pair on disk (so restarts don't lose data):

# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
    model_id: str
    task_type: str
    scores: list[float]   # all scores (None excluded)
    promoted: bool = False
    demoted: bool = False

    @property
    def mean(self) -> float:
        return sum(self.scores) / len(self.scores) if self.scores else 0.0

    @property
    def n(self) -> int:
        return len(self.scores)

    def should_promote(self) -> bool:
        return self.n >= 200 and self.mean >= 0.95 and not self.promoted

    def should_demote(self) -> bool:
        recent = self.scores[-50:]  # last 50
        pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
        return pass_rate < 0.92 and not self.demoted

### Step 5 — Accumulator loop

Run this on a cron (every 10-20 minutes via launchd/systemd):

# run_accumulate.py
async def accumulate():
    task_type = pick_next_task()  # round-robin across task types
    prompt, gt_response = generate_task(task_type)  # call cloud baseline

    for candidate in router.get_candidates(task_type):
        local_response = await ollama_client.complete(candidate, prompt)
        score = await evaluate_pair(prompt, local_response, gt_response, task_type)
        confidence_tracker.record(candidate, task_type, score)

        if confidence_tracker.should_promote(candidate, task_type):
            router.promote(candidate, task_type)
            langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))

### Step 6 — Routing policy

# config/routing_policy.yaml
control_floor_model: phi4:latest   # never promote below this model's score

task_policies:
  policy_check_high_risk:
    never_local: true              # these tasks always use cloud model

  summarize:
    min_score_for_routing: 0.85
    fallback_chain: [qwen2.5, llama3.1, phi4]

  classify:
    min_score_for_routing: 0.90   # higher bar for classification
    fallback_chain: [qwen2.5, granite4, llama3.1]

### Step 7 — API

Expose a simple HTTP API (FastAPI):

POST /run          — route a task through the best available model
GET  /health       — service status + promoted models + ollama connectivity
GET  /status       — full scoreboard (model × task × mean × n)
GET  /report       — cost heatmap + efficiency analysis

### Key lessons learned (from 900+ production runs)

What worked:

phi4 as control floor: a measured floor model prevents "promoted because everyone
else is also bad" errors. If the floor model beats a candidate, flag it — don't promote.
Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning)
must have <think>...</think> blocks stripped before evaluation. Otherwise Layer 2
drift detection flags the reasoning chain as hallucinated content.
None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run.
Store None, exclude from mean. Mixing None with 0.0 poisons the mean.
require_json: False for plain-text tasks: classify and extract tasks that return
formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate
the "is the format correct" check from "is it valid JSON."
Per-task weight overrides: do not use one weight profile for all task types.
Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as
the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70.
Structured output prompts for analyze tasks: add a system_prompt that specifies
an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and
candidates follow the same template, improving structural alignment and reducing drift
penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.
MCP server for agentic access: expose CP as MCP tools (run_task, get_status,
get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent
query evaluation state without bespoke integration work.

What didn't work:

Large models (>9GB): gpt-oss:20b and similar required 39+ second inference —
the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models
on 24GB unified memory to avoid GPU memory swapping.
100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation
costs more in judge API fees than you save by routing locally. Sample at 15%.
Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use
qdrant or numpy cosine store instead.
One-size-fits-all weight profiles: defining global weights at system init and never
overriding per task type led to all analyze evals silently failing for 112+ runs.
Lesson: evaluate your evaluator's scores by task type early — if a whole task type
caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.

### Expected timeline

With a 20-minute accumulator cadence and 9 candidates × 7 task types:

First 50 runs per model: ~5 hours
First promotions (200 runs): ~1-2 days per model/task pair
Stable routing layer: 1-2 weeks

### Cost estimate

Per accumulation cycle (one task, one model):

Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens)
Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini)
Local model: $0 (Ollama, on-device)

At 6 runs/hour × 24 hours: ~$0.70/day during burn-in.
After first promotions: drops to ~$0.10/day (90%+ of task volume local).
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: nissan
- Version: 1.2.1
## Source health
- Status: healthy
- Item download looks usable.
- Yavira can redirect you to the upstream package for this item.
- Health scope: item
- Reason: direct_download_ok
- Checked at: 2026-05-02T05:56:13.623Z
- Expires at: 2026-05-09T05:56:13.623Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/llm-eval-router)
- [Send to Agent page](https://openagent3.xyz/skills/llm-eval-router/agent)
- [JSON manifest](https://openagent3.xyz/skills/llm-eval-router/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/llm-eval-router/agent.md)
- [Download page](https://openagent3.xyz/downloads/llm-eval-router)