← All skills
Tencent SkillHub · AI

Llm Eval Router

Shadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent — re...

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Shadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent — re...

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.2.1

Documentation

ClawHub primary doc Primary doc: SKILL.md 20 sections Open source page

llm-eval-router

Set up a production-quality shadow evaluation pipeline that automatically promotes local Ollama models when they statistically prove they match cloud model quality — reducing inference costs with evidence, not hope.

The core idea

Run every task through your best local model (shadow) in parallel with your cloud baseline (ground truth). A lightweight judge ensemble scores the local output. After 200+ runs, if the local model hits 0.95 mean score, promote it to handle that task type in production. Demote it automatically if quality drops.

When to use

You're paying for Claude/GPT API calls on tasks that don't need that quality You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.) You want evidence-based cost reduction, not blind routing You have defined task types: summarize, classify, extract, format, analyze, RAG

When NOT to use

Tasks that require real-time web knowledge (use cloud) Tasks with strict latency requirements < 2 seconds (local models on CPU are slow) Tasks with high safety stakes (always use cloud with safety filters) You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)

Prerequisites

Ollama installed and running (ollama.com) At least one capable model: ollama pull qwen2.5 or ollama pull phi4 Python 3.10+ API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker) Langfuse for observability (self-hosted or cloud) — optional but strongly recommended

Network & Privacy

This skill makes outbound API calls to: Anthropic API — to generate ground truth baseline responses (every accumulation cycle) OpenAI API — for judge scoring (sampled at 15% of runs) Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20) What stays local: All Ollama model inference runs entirely on your device Scored run data is stored on disk in data/scores/*.json No telemetry, analytics, or data collection of any kind No data is sent anywhere other than the explicit API calls above Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.

6-Dimension Evaluation

Every response is scored on: DimensionDefault weightAnalyze weightWhat it measuresStructural25%10%Format compliance, required keys presentSemantic25%40%Meaning equivalence to ground truthFactual20%25%No hallucinated facts/numbers/entitiesCompletion15%18%Task fully addressedTool use10%4%Correct tool/format selectionLatency5%3%Within acceptable bounds Important: Use per-task weight overrides. The default 25/25 split treats structural accuracy equally with semantic similarity — which works for extract/classify/format tasks (where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher on two prose analyses of the same question scores ~0.29 even when they're semantically identical. With structural weight at 25%, this alone caps analyze scores at ~0.59. # src/evaluator.py — per-task weight profiles TASK_WEIGHT_OVERRIDES = { "analyze": { "structural_accuracy": 0.10, # difflib is NOT meaningful for prose "semantic_similarity": 0.40, # cosine over embeddings captures meaning "factual_drift": 0.25, "task_completion": 0.18, "tool_correctness": 0.04, "latency_score": 0.03, }, "code_transform": { "structural_accuracy": 0.15, "semantic_similarity": 0.35, "factual_drift": 0.20, "task_completion": 0.20, "tool_correctness": 0.07, "latency_score": 0.03, }, } Also: For analyze tasks, constrain output structure via system_prompt so GT and candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning). This reduces Layer 2 drift and improves difflib scores even at reduced weight.

Judge ensemble

Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash Unsampled runs (85%): Layer 1+2 validators only (deterministic, free) Promotion gates always trigger full judge evaluation regardless of sampling rate

Layer 1+2 validators (free, deterministic)

Layer 1: JSON validity, required key presence, forbidden pattern check Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth These run on every response at zero cost. Judges only run when L1+L2 pass and the sampling rate triggers.

Promotion / Demotion

Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair Demote: rolling 7-day pass rate < 0.92 Control floor: one model (phi4, granite4, or similar) serves as the measured floor — any model scoring below it should be flagged, not promoted

Step 1 — Define your task types

Create config/task_types.yaml: tasks: - id: summarize description: "Summarize a document in N sentences" require_json: false judge_dimensions: [semantic, factual, completion] - id: classify description: "Classify text into one of N categories" require_json: true # response must be valid JSON judge_dimensions: [structural, semantic, completion] - id: extract description: "Extract structured data from unstructured text" require_json: true judge_dimensions: [structural, factual, completion] - id: format description: "Reformat content to match a template" require_json: false judge_dimensions: [structural, semantic, completion]

Step 2 — Set up the router

The router assigns each task to a model using a round-robin strategy during burn-in (building n), then switches to confidence-weighted routing after promotion. # src/router.py — simplified version class Router: def __init__(self, candidates: list[str], control_floor: str): self.candidates = candidates self.control_floor = control_floor self._rr_counters = defaultdict(int) def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str: """Return the best model for this task type.""" promoted = confidence_tracker.get_promoted(task_type) if promoted: return promoted # use promoted model directly # Round-robin during burn-in for fair exposure idx = self._rr_counters[task_type] % len(self.candidates) self._rr_counters[task_type] += 1 return self.candidates[idx]

Step 3 — Ground truth comparison

For each task, run it through BOTH the local model (candidate) and the cloud baseline (ground truth). Never use the ground truth response in production — it's only for evaluation. async def evaluate_pair(prompt: str, local_response: str, gt_response: str, task_type: str) -> float: # Layer 1: deterministic l1_score = validators.layer1(local_response, task_type) if l1_score == 0.0: return 0.0 # hard fail — safety or format violation # Layer 2: heuristic drift l2_score = validators.layer2(local_response, gt_response) # Sample judges (15%) if random.random() < JUDGE_SAMPLE_RATE: sonnet_score = await judge_sonnet(prompt, local_response, gt_response) mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response) if abs(sonnet_score - mini_score) >= 0.20: gemini_score = await judge_gemini(prompt, local_response, gt_response) final = median([sonnet_score, mini_score, gemini_score]) else: final = (sonnet_score + mini_score) / 2 return weighted_score(l1_score, l2_score, final) else: return weighted_score(l1_score, l2_score, judge_score=None)

Step 4 — Confidence tracker

Track scores per model/task pair on disk (so restarts don't lose data): # src/scoring/confidence.py — simplified @dataclass class ModelStats: model_id: str task_type: str scores: list[float] # all scores (None excluded) promoted: bool = False demoted: bool = False @property def mean(self) -> float: return sum(self.scores) / len(self.scores) if self.scores else 0.0 @property def n(self) -> int: return len(self.scores) def should_promote(self) -> bool: return self.n >= 200 and self.mean >= 0.95 and not self.promoted def should_demote(self) -> bool: recent = self.scores[-50:] # last 50 pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent) return pass_rate < 0.92 and not self.demoted

Step 5 — Accumulator loop

Run this on a cron (every 10-20 minutes via launchd/systemd): # run_accumulate.py async def accumulate(): task_type = pick_next_task() # round-robin across task types prompt, gt_response = generate_task(task_type) # call cloud baseline for candidate in router.get_candidates(task_type): local_response = await ollama_client.complete(candidate, prompt) score = await evaluate_pair(prompt, local_response, gt_response, task_type) confidence_tracker.record(candidate, task_type, score) if confidence_tracker.should_promote(candidate, task_type): router.promote(candidate, task_type) langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))

Step 6 — Routing policy

# config/routing_policy.yaml control_floor_model: phi4:latest # never promote below this model's score task_policies: policy_check_high_risk: never_local: true # these tasks always use cloud model summarize: min_score_for_routing: 0.85 fallback_chain: [qwen2.5, llama3.1, phi4] classify: min_score_for_routing: 0.90 # higher bar for classification fallback_chain: [qwen2.5, granite4, llama3.1]

Step 7 — API

Expose a simple HTTP API (FastAPI): POST /run — route a task through the best available model GET /health — service status + promoted models + ollama connectivity GET /status — full scoreboard (model × task × mean × n) GET /report — cost heatmap + efficiency analysis

Key lessons learned (from 900+ production runs)

What worked: phi4 as control floor: a measured floor model prevents "promoted because everyone else is also bad" errors. If the floor model beats a candidate, flag it — don't promote. Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning) must have <think>...</think> blocks stripped before evaluation. Otherwise Layer 2 drift detection flags the reasoning chain as hallucinated content. None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run. Store None, exclude from mean. Mixing None with 0.0 poisons the mean. require_json: False for plain-text tasks: classify and extract tasks that return formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate the "is the format correct" check from "is it valid JSON." Per-task weight overrides: do not use one weight profile for all task types. Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70. Structured output prompts for analyze tasks: add a system_prompt that specifies an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and candidates follow the same template, improving structural alignment and reducing drift penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses. MCP server for agentic access: expose CP as MCP tools (run_task, get_status, get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent query evaluation state without bespoke integration work. What didn't work: Large models (>9GB): gpt-oss:20b and similar required 39+ second inference — the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models on 24GB unified memory to avoid GPU memory swapping. 100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation costs more in judge API fees than you save by routing locally. Sample at 15%. Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use qdrant or numpy cosine store instead. One-size-fits-all weight profiles: defining global weights at system init and never overriding per task type led to all analyze evals silently failing for 112+ runs. Lesson: evaluate your evaluator's scores by task type early — if a whole task type caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.

Expected timeline

With a 20-minute accumulator cadence and 9 candidates × 7 task types: First 50 runs per model: ~5 hours First promotions (200 runs): ~1-2 days per model/task pair Stable routing layer: 1-2 weeks

Cost estimate

Per accumulation cycle (one task, one model): Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens) Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini) Local model: $0 (Ollama, on-device) At 6 runs/hour × 24 hours: ~$0.70/day during burn-in. After first promotions: drops to ~$0.10/day (90%+ of task volume local).

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
1 Docs
  • SKILL.md Primary doc