← All skills
Tencent SkillHub · Developer Tools

Peer Review

Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 12 sections Open source page

Peer Review — Local LLM Critique Layer

Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.

Architecture

Cloud Model (Claude) produces analysis │ ▼ ┌────────────────────────┐ │ Peer Review Fan-Out │ ├────────────────────────┤ │ Drift (Mistral 7B) │──► Critique A │ Pip (TinyLlama 1.1B) │──► Critique B │ Lume (Llama 3.1 8B) │──► Critique C └────────────────────────┘ │ ▼ Aggregator (consensus logic) │ ▼ Final: original + flagged issues

Swarm Bot Roles

BotModelRoleStrengthsDrift 🌊Mistral 7BMethodical analystStructured reasoning, catches logical gapsPip 🐣TinyLlama 1.1BFast checkerQuick sanity checks, low latencyLume 💡Llama 3.1 8BDeep thinkerNuanced analysis, catches subtle issues

Scripts

ScriptPurposescripts/peer-review.shSend single input to all models, collect critiquesscripts/peer-review-batch.shRun peer review across a corpus of samplesscripts/seed-test-corpus.shGenerate seeded error corpus for testing

Usage

# Single file review bash scripts/peer-review.sh <input_file> [output_dir] # Batch review bash scripts/peer-review-batch.sh <corpus_dir> [results_dir] # Generate test corpus bash scripts/seed-test-corpus.sh [count] [output_dir] Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.

Critique Prompt Template

You are a skeptical reviewer. Analyze the following text for errors. For each issue found, output JSON: {"category": "factual|logical|missing|overconfidence|hallucinated_source", "quote": "...", "issue": "...", "confidence": 0-100} If no issues found, output: {"issues": []} TEXT: --- {cloud_output} ---

Error Categories

CategoryDescriptionExamplefactualWrong numbers, dates, names"Bitcoin launched in 2010"logicalNon-sequiturs, unsupported conclusions"X is rising, therefore Y will fall"missingImportant context omittedIgnoring a major counterargumentoverconfidenceCertainty without justification"This will definitely happen" on 55% eventhallucinated_sourceCiting nonexistent sources"According to a 2024 Reuters report..."

Discord Workflow

Post analysis to #the-deep (or #swarm-lab) Drift, Pip, and Lume respond with independent critiques Celeste synthesizes: deduplicates flags, weights by model confidence If consensus (≥2 models agree) → flag is high-confidence Final output posted with recommendation: publish | revise | flag_for_human

Success Criteria

OutcomeTPRFPRDecisionStrong pass≥50%<30%Ship as default layerPass≥30%<50%Ship as opt-in layerMarginal20–30%50–70%Iterate on prompts, retestFail<20%>70%Abandon approach

Scoring Rules

Flag = true positive if it identifies a real error (even if explanation is imperfect) Flag = false positive if flagged content is actually correct Duplicate flags across models count once for TPR but inform consensus metrics

Dependencies

Ollama running locally with models pulled: mistral:7b, tinyllama:1.1b, llama3.1:8b jq and curl installed Results stored in experiments/peer-review-results/

Integration

When peer review passes validation: Package as Reef API endpoint: POST /review Agents call before publishing any analysis Configurable: model selection, consensus threshold, categories Log all reviews to #reef-logs with TPR tracking

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
1 Docs
  • SKILL.md Primary doc