← All skills
Tencent SkillHub · Developer Tools

Eval Skills

AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this...

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this...

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Item is unstable.

This item is timing out or returning errors right now. Review the source page and try again later.

Quick setup
  1. Wait for the source to recover or retry later.
  2. Review SKILL.md only after the source returns a real package.
  3. Do not rely on this source for automated install yet.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Manual review
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
CHANGELOG.md, README.md, SKILL.md, benchmarks/coding-easy/benchmark.json, benchmarks/gaia-v1/benchmark.json, benchmarks/skill-quality/benchmark.json

Validation

  • Wait for the source to recover or retry later.
  • Review SKILL.md only after the download returns a real package.
  • Treat this source as transient until the upstream errors clear.

Install with your agent

Agent handoff

Use the source page and any available docs to guide the install because the item is currently unstable or timing out.

  1. Open the source page via Review source status.
  2. If you can obtain the package, extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the source page and extracted files.
New install

I tried to install a skill package from Yavira, but the item is currently unstable or timing out. Inspect the source page and any extracted docs, then tell me what you can confirm and any manual steps still required. Then review README.md for any prerequisites, environment setup, or post-install checks.

Upgrade existing

I tried to upgrade a skill package from Yavira, but the item is currently unstable or timing out. Compare the source page and any extracted docs with my current installation, then summarize what changed and what manual follow-up I still need. Then review README.md for any prerequisites, environment setup, or post-install checks.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
0.1.1

Documentation

ClawHub primary doc Primary doc: SKILL.md 23 sections Open source page

eval-skills

AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.

When to Use This Skill

Before deploying a new skill to production — run eval to verify it meets your quality gate. When choosing between multiple candidate skills — run select to rank them on the same benchmark. When a skill is upgraded — run report diff to detect regressions. In CI/CD — use --exit-on-fail to block merges that degrade skill quality. When bootstrapping a new skill — run create to generate a ready-to-fill skeleton.

1. Find Skills

Search for existing skills by keyword, tag, or adapter type. eval-skills find \ --query "web search" \ --tag retrieval api \ --adapter http \ --min-completion 0.8 \ --skills-dir ./skills \ --limit 10 OptionDescriptionDefault-q, --query <string>Keyword search (matches name, description, tags)—-t, --tag <tags...>Filter by tags (intersection: skill must have ALL specified tags)—-a, --adapter <type>Filter by adapter type (http, subprocess, mcp)—--min-completion <rate>Minimum historical completion rate (0.0 ~ 1.0)—--skills-dir <dir>Directory to scan for skill.json files./skills--limit <n>Maximum number of results20 Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).

2. Create Skills

Generate a skill skeleton from a template to bootstrap development. eval-skills create \ --name my_api_skill \ --from-template http_request \ --output-dir ./skills \ --description "Fetches weather data from OpenWeather API" OptionDescriptionDefault--name <name>Required. Skill name—--from-template <tpl>Template type: http_request, python_script, mcp_toolhttp_request--output-dir <dir>Output directory./skills--description <text>Human-readable description embedded in skill.json— Generated file structure: skills/my_api_skill/ skill.json # Skill metadata (id, schemas, adapter config) adapter.config.json # Adapter-specific configuration tests/ basic.eval.json # A starter benchmark with one sample task skill.py # (python_script template only) JSON-RPC entrypoint

3. Evaluate Skills

Run benchmark evaluations against one or more skills. This is the core command. eval-skills eval \ --skills ./skills/calculator/skill.json ./skills/search/ \ --benchmark coding-easy \ --concurrency 4 \ --timeout 30000 \ --retries 2 \ --runs 3 \ --evaluator exact \ --format json markdown html \ --output-dir ./reports \ --exit-on-fail --min-completion 0.8 \ --store ./eval-skills.db OptionDescriptionDefault--skills <paths...>Required. Skill file(s) or directory(ies)—--benchmark <id|path>Built-in benchmark ID or path to benchmark.jsoncoding-easy--tasks <file>Custom tasks JSON file (replaces benchmark)—--concurrency <n>Number of parallel task executions4--timeout <ms>Per-task timeout in milliseconds30000--retries <n>Retry count on task failure (with incremental backoff)0--runs <n>Repeat evaluation N times for consistency scoring1--evaluator <type>Default scorer type (see Scorer Types below)exact--format <formats...>Output formats: json, markdown, htmljson markdown--output-dir <dir>Report output directory./reports--exit-on-failExit with code 1 if any skill falls below thresholddisabled--min-completion <rate>Threshold for --exit-on-fail0.7--dry-runValidate configuration only; do not execute tasksdisabled--benchmarks-dir <dir>Directory containing built-in benchmarks./benchmarks--store <path>SQLite database path for persistent result storage./eval-skills.db-c, --config <path>Path to eval-skills.config.yamlauto-detected Evaluation flow: Load skills from --skills paths (supports both single skill.json and directories) Load benchmark tasks from --benchmark or --tasks Build the cartesian product: skills x tasks x runs Execute all task items concurrently (controlled by --concurrency, with timeout and retry) Score each result using the appropriate scorer Aggregate into SkillCompletionReport per skill Write reports to --output-dir

4. Select Skills

Filter and rank skills based on evaluation reports using a multi-dimensional strategy. eval-skills select \ --from ./skills \ --reports ./reports/eval-result.json \ --strategy ./strategy.yaml \ --min-completion 0.8 \ --top-k 5 \ --output ./selected.json OptionDescriptionDefault--from <path>Required. Candidate skills directory or JSON file—--reports <file>Evaluation reports JSON file—--strategy <file>SelectStrategy YAML/JSON filebuilt-in default--min-completion <rate>Override minimum completion rate filter—--top-k <n>Return only the top K resultsall--output <file>Write selected skills to filestdout Selection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK Example strategy.yaml: filters: minCompletionRate: 0.8 maxErrorRate: 0.1 maxLatencyP95Ms: 5000 adapterTypes: [http, subprocess] requiredTags: [production-ready] sortBy: compositeScore order: desc topK: 5

5. Run Pipeline

Execute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command. eval-skills run \ --query "math" \ --benchmark coding-easy \ --skills-dir ./skills \ --top-k 3 \ --min-completion 0.7 \ --format json markdown \ --output-dir ./reports This command automates the entire process: Find — scans --skills-dir and optionally filters by --query Eval — evaluates all candidate skills against --benchmark Select — filters and ranks results using --min-completion, --top-k, and optional --strategy Report — generates output files in all requested --formats

6. Generate & Compare Reports

Convert report format eval-skills report convert \ --input ./reports/eval-result.json \ --format html \ --output ./reports/eval-result.html Supported output formats: markdown, html. Diff two reports (regression detection) eval-skills report diff \ ./reports/v1.json ./reports/v2.json \ --label-a "v1.0" --label-b "v2.0" \ --output ./reports/diff.md Generates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows.

7. Initialize Project

eval-skills init --dir . Creates the project scaffold: eval-skills.config.yaml — global configuration skills/ — directory for skill definitions benchmarks/ — directory for benchmark files reports/ — directory for evaluation output

8. Manage Configuration

# List all current configuration values eval-skills config list # Get a specific value (supports dot notation) eval-skills config get llm.model # Set a value (persisted to ~/.eval-skills/config.yaml) eval-skills config set concurrency 8 eval-skills config set llm.model gpt-4o eval-skills config set llm.temperature 0 Configuration is resolved in priority order: CLI flags (highest priority) eval-skills.config.yaml in current directory ~/.eval-skills/config.yaml Built-in defaults (concurrency: 4, timeoutMs: 30000, outputDir: ./reports)

Scorer Types

Each task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output. TypeAliasesDescriptionScore Rangeexact_matchexactStrict equality comparison. Supports caseSensitive option.0 or 1contains—Checks for the presence of all specified keywords in the output. Partial credit: matched_keywords / total_keywords.0.0 ~ 1.0json_schemaschemaValidates output against a JSON Schema (using Ajv).0 or 1llm_judge—Sends the output + expected rubric to an LLM (configurable model) for quality rating.0.0 ~ 1.0custom—Loads a custom scorer from expectedOutput.customScorerPath.0.0 ~ 1.0

Evaluation Metrics

Every evaluation produces a SkillCompletionReport with these metrics: MetricDescriptionFormulaCompletion RateFraction of tasks that passedpass_count / total_countPartial ScoreMean score across all tasksmean(task_scores)Error RateFraction of tasks that errored or timed out(error_count + timeout_count) / total_countConsistency ScoreStability across multiple runs (requires --runs >= 2)1 - stddev(per_run_completion_rates)P50 / P95 / P99 LatencyResponse time percentilesSorted percentile of latencyMsComposite ScoreWeighted overall quality score0.5 * CR + 0.2 * (1 - latP95_norm) + 0.3 * (1 - ER)

Built-in Benchmarks

IDDomainTasksScoringDescriptioncoding-easycoding20mean / exact_matchMath expressions, string reversal, palindrome detectionskill-qualitytool-use5mean / containsMetadata completeness, description quality, structure checksweb-search-basicweb8mean / contains + schemaFactual queries, keyword verification, structured output validationgaia-v1general—meanPlaceholder for GAIA benchmark Level 1 taskstoolbench-litetool-use—meanPlaceholder for ToolBench single-tool scenarios

Custom Benchmark

Create a benchmark.json file: { "id": "my-benchmark", "name": "My Custom Benchmark", "version": "1.0.0", "domain": "general", "scoringMethod": "mean", "maxLatencyMs": 30000, "metadata": { "source": "internal", "lastUpdated": "2026-02-28" }, "tasks": [ { "id": "task_001", "description": "Test basic addition", "inputData": { "expression": "2+3" }, "expectedOutput": { "type": "exact", "value": "5" }, "evaluator": { "type": "exact" }, "timeoutMs": 10000, "tags": ["math"] }, { "id": "task_002", "description": "Test keyword presence", "inputData": { "query": "TypeScript" }, "expectedOutput": { "type": "contains", "keywords": ["JavaScript", "Microsoft"] }, "evaluator": { "type": "contains", "caseSensitive": false }, "timeoutMs": 15000, "tags": ["search"] } ] } eval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json

Adapter Types

Skills communicate through adapters. The adapter type is specified in skill.json via adapterType. AdapterProtocolHow it worksKey confighttpREST POSTSends POST { skillId, version, input } to skill.entrypoint. Supports Bearer / API-Key auth via env vars.baseUrl, authType, authTokenEnvKeysubprocessJSON-RPC 2.0 over stdin/stdoutSpawns skill.entrypoint (e.g. python3 skill.py), writes JSON-RPC request to stdin, reads response from stdout.command, argsmcpMCP Protocol(Phase 2) Native Model Context Protocol integration via @modelcontextprotocol/sdk.—

Evaluating a Single Skill

# 1. Create a skill skeleton eval-skills create --name my_calc --from-template python_script # 2. Implement your logic in skills/my_calc/skill.py # 3. Run evaluation against the coding-easy benchmark eval-skills eval \ --skills ./skills/my_calc/skill.json \ --benchmark coding-easy \ --runs 3 \ --format json markdown # 4. Review the report cat ./reports/eval-result-*.md

Comparing Multiple Candidate Skills

# 1. Discover candidates eval-skills find --query "weather" --skills-dir ./skills # 2. Evaluate all candidates on the same benchmark eval-skills eval \ --skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \ --benchmark web-search-basic \ --runs 3 # 3. Select the best eval-skills select \ --from ./skills \ --reports ./reports/eval-result-*.json \ --min-completion 0.8 \ --top-k 2 # 4. Compare two versions eval-skills report diff \ ./reports/v1.json ./reports/v2.json \ --label-a "weather_v1" --label-b "weather_v2"

Full Pipeline (One Command)

eval-skills run \ --skills-dir ./skills \ --benchmark coding-easy \ --top-k 3 \ --min-completion 0.7 \ --format json markdown html \ --output-dir ./reports

CI/CD Quality Gate

# In your CI pipeline — fail the build if completion rate drops below 80% eval-skills eval \ --skills ./skills/production_skill \ --benchmark coding-easy \ --exit-on-fail \ --min-completion 0.8 \ --format json

Regression Detection

# Compare today's evaluation against the baseline eval-skills report diff \ ./reports/baseline.json ./reports/latest.json \ --label-a "baseline" --label-b "latest" \ --output ./reports/regression-check.md

Best Practices

Always use --runs 3 or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs. Use --exit-on-fail in CI/CD pipelines to enforce quality gates. Set --min-completion to your acceptable threshold (recommended: 0.8 for production skills). Create domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter. Use report diff after every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report. Use --dry-run before long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks. Persist results with --store to track skill quality over time. The SQLite store enables historical trend queries. Start with --concurrency 1 when debugging a failing skill, then increase for production benchmarking. Tag your benchmark tasks to enable per-category analysis (e.g., filter by math, string, edge-case).

Skill JSON Schema

Every skill must provide a skill.json that conforms to this structure: { "id": "my_skill_v1", "name": "My Skill", "version": "1.0.0", "description": "Does something useful", "tags": ["utility", "math"], "inputSchema": { "type": "object", "properties": { "query": { "type": "string" } }, "required": ["query"] }, "outputSchema": { "type": "object", "properties": { "result": { "type": "string" } } }, "adapterType": "subprocess", "entrypoint": "python3 skill.py", "metadata": { "author": "Your Name", "license": "MIT", "homepage": "https://github.com/you/my-skill" } } Validation rules: id: lowercase alphanumeric with _ or -, non-empty version: semver format (X.Y.Z) adapterType: one of http, subprocess, mcp, langchain, custom entrypoint: non-empty string (URL for http, command for subprocess)

Global Options

These options are available on all commands: OptionDescription-c, --config <path>Path to configuration file--jsonJSON output format (CI-friendly)--no-colorDisable colored output-v, --verboseVerbose logging--versionShow version-h, --helpShow help

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
3 Docs3 Config
  • SKILL.md Primary doc
  • CHANGELOG.md Docs
  • README.md Docs
  • benchmarks/coding-easy/benchmark.json Config
  • benchmarks/gaia-v1/benchmark.json Config
  • benchmarks/skill-quality/benchmark.json Config