Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.
Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.
Cloud Model (Claude) produces analysis │ ▼ ┌────────────────────────┐ │ Peer Review Fan-Out │ ├────────────────────────┤ │ Drift (Mistral 7B) │──► Critique A │ Pip (TinyLlama 1.1B) │──► Critique B │ Lume (Llama 3.1 8B) │──► Critique C └────────────────────────┘ │ ▼ Aggregator (consensus logic) │ ▼ Final: original + flagged issues
BotModelRoleStrengthsDrift 🌊Mistral 7BMethodical analystStructured reasoning, catches logical gapsPip 🐣TinyLlama 1.1BFast checkerQuick sanity checks, low latencyLume 💡Llama 3.1 8BDeep thinkerNuanced analysis, catches subtle issues
ScriptPurposescripts/peer-review.shSend single input to all models, collect critiquesscripts/peer-review-batch.shRun peer review across a corpus of samplesscripts/seed-test-corpus.shGenerate seeded error corpus for testing
# Single file review bash scripts/peer-review.sh <input_file> [output_dir] # Batch review bash scripts/peer-review-batch.sh <corpus_dir> [results_dir] # Generate test corpus bash scripts/seed-test-corpus.sh [count] [output_dir] Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.
You are a skeptical reviewer. Analyze the following text for errors. For each issue found, output JSON: {"category": "factual|logical|missing|overconfidence|hallucinated_source", "quote": "...", "issue": "...", "confidence": 0-100} If no issues found, output: {"issues": []} TEXT: --- {cloud_output} ---
CategoryDescriptionExamplefactualWrong numbers, dates, names"Bitcoin launched in 2010"logicalNon-sequiturs, unsupported conclusions"X is rising, therefore Y will fall"missingImportant context omittedIgnoring a major counterargumentoverconfidenceCertainty without justification"This will definitely happen" on 55% eventhallucinated_sourceCiting nonexistent sources"According to a 2024 Reuters report..."
Post analysis to #the-deep (or #swarm-lab) Drift, Pip, and Lume respond with independent critiques Celeste synthesizes: deduplicates flags, weights by model confidence If consensus (≥2 models agree) → flag is high-confidence Final output posted with recommendation: publish | revise | flag_for_human
OutcomeTPRFPRDecisionStrong pass≥50%<30%Ship as default layerPass≥30%<50%Ship as opt-in layerMarginal20–30%50–70%Iterate on prompts, retestFail<20%>70%Abandon approach
Flag = true positive if it identifies a real error (even if explanation is imperfect) Flag = false positive if flagged content is actually correct Duplicate flags across models count once for TPR but inform consensus metrics
Ollama running locally with models pulled: mistral:7b, tinyllama:1.1b, llama3.1:8b jq and curl installed Results stored in experiments/peer-review-results/
When peer review passes validation: Package as Reef API endpoint: POST /review Agents call before publishing any analysis Configurable: model selection, consensus threshold, categories Log all reviews to #reef-logs with TPR tracking
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.