{
  "schemaVersion": "1.0",
  "item": {
    "slug": "eval-skills",
    "name": "Eval Skills",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/isLinXu/eval-skills",
    "canonicalUrl": "https://clawhub.ai/isLinXu/eval-skills",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/eval-skills",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=eval-skills",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "CHANGELOG.md",
      "README.md",
      "SKILL.md",
      "benchmarks/coding-easy/benchmark.json",
      "benchmarks/gaia-v1/benchmark.json",
      "benchmarks/skill-quality/benchmark.json"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/eval-skills"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/eval-skills",
    "agentPageUrl": "https://openagent3.xyz/skills/eval-skills/agent",
    "manifestUrl": "https://openagent3.xyz/skills/eval-skills/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/eval-skills/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "eval-skills",
        "body": "AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.\n\nThis skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline."
      },
      {
        "title": "When to Use This Skill",
        "body": "Before deploying a new skill to production — run eval to verify it meets your quality gate.\nWhen choosing between multiple candidate skills — run select to rank them on the same benchmark.\nWhen a skill is upgraded — run report diff to detect regressions.\nIn CI/CD — use --exit-on-fail to block merges that degrade skill quality.\nWhen bootstrapping a new skill — run create to generate a ready-to-fill skeleton."
      },
      {
        "title": "1. Find Skills",
        "body": "Search for existing skills by keyword, tag, or adapter type.\n\neval-skills find \\\n  --query \"web search\" \\\n  --tag retrieval api \\\n  --adapter http \\\n  --min-completion 0.8 \\\n  --skills-dir ./skills \\\n  --limit 10\n\nOptionDescriptionDefault-q, --query <string>Keyword search (matches name, description, tags)—-t, --tag <tags...>Filter by tags (intersection: skill must have ALL specified tags)—-a, --adapter <type>Filter by adapter type (http, subprocess, mcp)—--min-completion <rate>Minimum historical completion rate (0.0 ~ 1.0)—--skills-dir <dir>Directory to scan for skill.json files./skills--limit <n>Maximum number of results20\n\nResults are ranked by search relevance (when --query is provided) or by historical completion rate (descending)."
      },
      {
        "title": "2. Create Skills",
        "body": "Generate a skill skeleton from a template to bootstrap development.\n\neval-skills create \\\n  --name my_api_skill \\\n  --from-template http_request \\\n  --output-dir ./skills \\\n  --description \"Fetches weather data from OpenWeather API\"\n\nOptionDescriptionDefault--name <name>Required. Skill name—--from-template <tpl>Template type: http_request, python_script, mcp_toolhttp_request--output-dir <dir>Output directory./skills--description <text>Human-readable description embedded in skill.json—\n\nGenerated file structure:\n\nskills/my_api_skill/\n  skill.json            # Skill metadata (id, schemas, adapter config)\n  adapter.config.json   # Adapter-specific configuration\n  tests/\n    basic.eval.json     # A starter benchmark with one sample task\n  skill.py              # (python_script template only) JSON-RPC entrypoint"
      },
      {
        "title": "3. Evaluate Skills",
        "body": "Run benchmark evaluations against one or more skills. This is the core command.\n\neval-skills eval \\\n  --skills ./skills/calculator/skill.json ./skills/search/ \\\n  --benchmark coding-easy \\\n  --concurrency 4 \\\n  --timeout 30000 \\\n  --retries 2 \\\n  --runs 3 \\\n  --evaluator exact \\\n  --format json markdown html \\\n  --output-dir ./reports \\\n  --exit-on-fail --min-completion 0.8 \\\n  --store ./eval-skills.db\n\nOptionDescriptionDefault--skills <paths...>Required. Skill file(s) or directory(ies)—--benchmark <id|path>Built-in benchmark ID or path to benchmark.jsoncoding-easy--tasks <file>Custom tasks JSON file (replaces benchmark)—--concurrency <n>Number of parallel task executions4--timeout <ms>Per-task timeout in milliseconds30000--retries <n>Retry count on task failure (with incremental backoff)0--runs <n>Repeat evaluation N times for consistency scoring1--evaluator <type>Default scorer type (see Scorer Types below)exact--format <formats...>Output formats: json, markdown, htmljson markdown--output-dir <dir>Report output directory./reports--exit-on-failExit with code 1 if any skill falls below thresholddisabled--min-completion <rate>Threshold for --exit-on-fail0.7--dry-runValidate configuration only; do not execute tasksdisabled--benchmarks-dir <dir>Directory containing built-in benchmarks./benchmarks--store <path>SQLite database path for persistent result storage./eval-skills.db-c, --config <path>Path to eval-skills.config.yamlauto-detected\n\nEvaluation flow:\n\nLoad skills from --skills paths (supports both single skill.json and directories)\nLoad benchmark tasks from --benchmark or --tasks\nBuild the cartesian product: skills x tasks x runs\nExecute all task items concurrently (controlled by --concurrency, with timeout and retry)\nScore each result using the appropriate scorer\nAggregate into SkillCompletionReport per skill\nWrite reports to --output-dir"
      },
      {
        "title": "4. Select Skills",
        "body": "Filter and rank skills based on evaluation reports using a multi-dimensional strategy.\n\neval-skills select \\\n  --from ./skills \\\n  --reports ./reports/eval-result.json \\\n  --strategy ./strategy.yaml \\\n  --min-completion 0.8 \\\n  --top-k 5 \\\n  --output ./selected.json\n\nOptionDescriptionDefault--from <path>Required. Candidate skills directory or JSON file—--reports <file>Evaluation reports JSON file—--strategy <file>SelectStrategy YAML/JSON filebuilt-in default--min-completion <rate>Override minimum completion rate filter—--top-k <n>Return only the top K resultsall--output <file>Write selected skills to filestdout\n\nSelection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK\n\nExample strategy.yaml:\n\nfilters:\n  minCompletionRate: 0.8\n  maxErrorRate: 0.1\n  maxLatencyP95Ms: 5000\n  adapterTypes: [http, subprocess]\n  requiredTags: [production-ready]\nsortBy: compositeScore\norder: desc\ntopK: 5"
      },
      {
        "title": "5. Run Pipeline",
        "body": "Execute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command.\n\neval-skills run \\\n  --query \"math\" \\\n  --benchmark coding-easy \\\n  --skills-dir ./skills \\\n  --top-k 3 \\\n  --min-completion 0.7 \\\n  --format json markdown \\\n  --output-dir ./reports\n\nThis command automates the entire process:\n\nFind — scans --skills-dir and optionally filters by --query\nEval — evaluates all candidate skills against --benchmark\nSelect — filters and ranks results using --min-completion, --top-k, and optional --strategy\nReport — generates output files in all requested --formats"
      },
      {
        "title": "6. Generate & Compare Reports",
        "body": "Convert report format\n\neval-skills report convert \\\n  --input ./reports/eval-result.json \\\n  --format html \\\n  --output ./reports/eval-result.html\n\nSupported output formats: markdown, html.\n\nDiff two reports (regression detection)\n\neval-skills report diff \\\n  ./reports/v1.json ./reports/v2.json \\\n  --label-a \"v1.0\" --label-b \"v2.0\" \\\n  --output ./reports/diff.md\n\nGenerates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows."
      },
      {
        "title": "7. Initialize Project",
        "body": "eval-skills init --dir .\n\nCreates the project scaffold:\n\neval-skills.config.yaml — global configuration\nskills/ — directory for skill definitions\nbenchmarks/ — directory for benchmark files\nreports/ — directory for evaluation output"
      },
      {
        "title": "8. Manage Configuration",
        "body": "# List all current configuration values\neval-skills config list\n\n# Get a specific value (supports dot notation)\neval-skills config get llm.model\n\n# Set a value (persisted to ~/.eval-skills/config.yaml)\neval-skills config set concurrency 8\neval-skills config set llm.model gpt-4o\neval-skills config set llm.temperature 0\n\nConfiguration is resolved in priority order:\n\nCLI flags (highest priority)\neval-skills.config.yaml in current directory\n~/.eval-skills/config.yaml\nBuilt-in defaults (concurrency: 4, timeoutMs: 30000, outputDir: ./reports)"
      },
      {
        "title": "Scorer Types",
        "body": "Each task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output.\n\nTypeAliasesDescriptionScore Rangeexact_matchexactStrict equality comparison. Supports caseSensitive option.0 or 1contains—Checks for the presence of all specified keywords in the output. Partial credit: matched_keywords / total_keywords.0.0 ~ 1.0json_schemaschemaValidates output against a JSON Schema (using Ajv).0 or 1llm_judge—Sends the output + expected rubric to an LLM (configurable model) for quality rating.0.0 ~ 1.0custom—Loads a custom scorer from expectedOutput.customScorerPath.0.0 ~ 1.0"
      },
      {
        "title": "Evaluation Metrics",
        "body": "Every evaluation produces a SkillCompletionReport with these metrics:\n\nMetricDescriptionFormulaCompletion RateFraction of tasks that passedpass_count / total_countPartial ScoreMean score across all tasksmean(task_scores)Error RateFraction of tasks that errored or timed out(error_count + timeout_count) / total_countConsistency ScoreStability across multiple runs (requires --runs >= 2)1 - stddev(per_run_completion_rates)P50 / P95 / P99 LatencyResponse time percentilesSorted percentile of latencyMsComposite ScoreWeighted overall quality score0.5 * CR + 0.2 * (1 - latP95_norm) + 0.3 * (1 - ER)"
      },
      {
        "title": "Built-in Benchmarks",
        "body": "IDDomainTasksScoringDescriptioncoding-easycoding20mean / exact_matchMath expressions, string reversal, palindrome detectionskill-qualitytool-use5mean / containsMetadata completeness, description quality, structure checksweb-search-basicweb8mean / contains + schemaFactual queries, keyword verification, structured output validationgaia-v1general—meanPlaceholder for GAIA benchmark Level 1 taskstoolbench-litetool-use—meanPlaceholder for ToolBench single-tool scenarios"
      },
      {
        "title": "Custom Benchmark",
        "body": "Create a benchmark.json file:\n\n{\n  \"id\": \"my-benchmark\",\n  \"name\": \"My Custom Benchmark\",\n  \"version\": \"1.0.0\",\n  \"domain\": \"general\",\n  \"scoringMethod\": \"mean\",\n  \"maxLatencyMs\": 30000,\n  \"metadata\": { \"source\": \"internal\", \"lastUpdated\": \"2026-02-28\" },\n  \"tasks\": [\n    {\n      \"id\": \"task_001\",\n      \"description\": \"Test basic addition\",\n      \"inputData\": { \"expression\": \"2+3\" },\n      \"expectedOutput\": { \"type\": \"exact\", \"value\": \"5\" },\n      \"evaluator\": { \"type\": \"exact\" },\n      \"timeoutMs\": 10000,\n      \"tags\": [\"math\"]\n    },\n    {\n      \"id\": \"task_002\",\n      \"description\": \"Test keyword presence\",\n      \"inputData\": { \"query\": \"TypeScript\" },\n      \"expectedOutput\": { \"type\": \"contains\", \"keywords\": [\"JavaScript\", \"Microsoft\"] },\n      \"evaluator\": { \"type\": \"contains\", \"caseSensitive\": false },\n      \"timeoutMs\": 15000,\n      \"tags\": [\"search\"]\n    }\n  ]\n}\n\neval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json"
      },
      {
        "title": "Adapter Types",
        "body": "Skills communicate through adapters. The adapter type is specified in skill.json via adapterType.\n\nAdapterProtocolHow it worksKey confighttpREST POSTSends POST { skillId, version, input } to skill.entrypoint. Supports Bearer / API-Key auth via env vars.baseUrl, authType, authTokenEnvKeysubprocessJSON-RPC 2.0 over stdin/stdoutSpawns skill.entrypoint (e.g. python3 skill.py), writes JSON-RPC request to stdin, reads response from stdout.command, argsmcpMCP Protocol(Phase 2) Native Model Context Protocol integration via @modelcontextprotocol/sdk.—"
      },
      {
        "title": "Evaluating a Single Skill",
        "body": "# 1. Create a skill skeleton\neval-skills create --name my_calc --from-template python_script\n\n# 2. Implement your logic in skills/my_calc/skill.py\n\n# 3. Run evaluation against the coding-easy benchmark\neval-skills eval \\\n  --skills ./skills/my_calc/skill.json \\\n  --benchmark coding-easy \\\n  --runs 3 \\\n  --format json markdown\n\n# 4. Review the report\ncat ./reports/eval-result-*.md"
      },
      {
        "title": "Comparing Multiple Candidate Skills",
        "body": "# 1. Discover candidates\neval-skills find --query \"weather\" --skills-dir ./skills\n\n# 2. Evaluate all candidates on the same benchmark\neval-skills eval \\\n  --skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \\\n  --benchmark web-search-basic \\\n  --runs 3\n\n# 3. Select the best\neval-skills select \\\n  --from ./skills \\\n  --reports ./reports/eval-result-*.json \\\n  --min-completion 0.8 \\\n  --top-k 2\n\n# 4. Compare two versions\neval-skills report diff \\\n  ./reports/v1.json ./reports/v2.json \\\n  --label-a \"weather_v1\" --label-b \"weather_v2\""
      },
      {
        "title": "Full Pipeline (One Command)",
        "body": "eval-skills run \\\n  --skills-dir ./skills \\\n  --benchmark coding-easy \\\n  --top-k 3 \\\n  --min-completion 0.7 \\\n  --format json markdown html \\\n  --output-dir ./reports"
      },
      {
        "title": "CI/CD Quality Gate",
        "body": "# In your CI pipeline — fail the build if completion rate drops below 80%\neval-skills eval \\\n  --skills ./skills/production_skill \\\n  --benchmark coding-easy \\\n  --exit-on-fail \\\n  --min-completion 0.8 \\\n  --format json"
      },
      {
        "title": "Regression Detection",
        "body": "# Compare today's evaluation against the baseline\neval-skills report diff \\\n  ./reports/baseline.json ./reports/latest.json \\\n  --label-a \"baseline\" --label-b \"latest\" \\\n  --output ./reports/regression-check.md"
      },
      {
        "title": "Best Practices",
        "body": "Always use --runs 3 or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs.\n\n\nUse --exit-on-fail in CI/CD pipelines to enforce quality gates. Set --min-completion to your acceptable threshold (recommended: 0.8 for production skills).\n\n\nCreate domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter.\n\n\nUse report diff after every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report.\n\n\nUse --dry-run before long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks.\n\n\nPersist results with --store to track skill quality over time. The SQLite store enables historical trend queries.\n\n\nStart with --concurrency 1 when debugging a failing skill, then increase for production benchmarking.\n\n\nTag your benchmark tasks to enable per-category analysis (e.g., filter by math, string, edge-case)."
      },
      {
        "title": "Skill JSON Schema",
        "body": "Every skill must provide a skill.json that conforms to this structure:\n\n{\n  \"id\": \"my_skill_v1\",\n  \"name\": \"My Skill\",\n  \"version\": \"1.0.0\",\n  \"description\": \"Does something useful\",\n  \"tags\": [\"utility\", \"math\"],\n  \"inputSchema\": {\n    \"type\": \"object\",\n    \"properties\": { \"query\": { \"type\": \"string\" } },\n    \"required\": [\"query\"]\n  },\n  \"outputSchema\": {\n    \"type\": \"object\",\n    \"properties\": { \"result\": { \"type\": \"string\" } }\n  },\n  \"adapterType\": \"subprocess\",\n  \"entrypoint\": \"python3 skill.py\",\n  \"metadata\": {\n    \"author\": \"Your Name\",\n    \"license\": \"MIT\",\n    \"homepage\": \"https://github.com/you/my-skill\"\n  }\n}\n\nValidation rules:\n\nid: lowercase alphanumeric with _ or -, non-empty\nversion: semver format (X.Y.Z)\nadapterType: one of http, subprocess, mcp, langchain, custom\nentrypoint: non-empty string (URL for http, command for subprocess)"
      },
      {
        "title": "Global Options",
        "body": "These options are available on all commands:\n\nOptionDescription-c, --config <path>Path to configuration file--jsonJSON output format (CI-friendly)--no-colorDisable colored output-v, --verboseVerbose logging--versionShow version-h, --helpShow help"
      }
    ],
    "body": "eval-skills\n\nAI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.\n\nThis skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.\n\nWhen to Use This Skill\nBefore deploying a new skill to production — run eval to verify it meets your quality gate.\nWhen choosing between multiple candidate skills — run select to rank them on the same benchmark.\nWhen a skill is upgraded — run report diff to detect regressions.\nIn CI/CD — use --exit-on-fail to block merges that degrade skill quality.\nWhen bootstrapping a new skill — run create to generate a ready-to-fill skeleton.\nCapabilities\n1. Find Skills\n\nSearch for existing skills by keyword, tag, or adapter type.\n\neval-skills find \\\n  --query \"web search\" \\\n  --tag retrieval api \\\n  --adapter http \\\n  --min-completion 0.8 \\\n  --skills-dir ./skills \\\n  --limit 10\n\nOption\tDescription\tDefault\n-q, --query <string>\tKeyword search (matches name, description, tags)\t—\n-t, --tag <tags...>\tFilter by tags (intersection: skill must have ALL specified tags)\t—\n-a, --adapter <type>\tFilter by adapter type (http, subprocess, mcp)\t—\n--min-completion <rate>\tMinimum historical completion rate (0.0 ~ 1.0)\t—\n--skills-dir <dir>\tDirectory to scan for skill.json files\t./skills\n--limit <n>\tMaximum number of results\t20\n\nResults are ranked by search relevance (when --query is provided) or by historical completion rate (descending).\n\n2. Create Skills\n\nGenerate a skill skeleton from a template to bootstrap development.\n\neval-skills create \\\n  --name my_api_skill \\\n  --from-template http_request \\\n  --output-dir ./skills \\\n  --description \"Fetches weather data from OpenWeather API\"\n\nOption\tDescription\tDefault\n--name <name>\tRequired. Skill name\t—\n--from-template <tpl>\tTemplate type: http_request, python_script, mcp_tool\thttp_request\n--output-dir <dir>\tOutput directory\t./skills\n--description <text>\tHuman-readable description embedded in skill.json\t—\n\nGenerated file structure:\n\nskills/my_api_skill/\n  skill.json            # Skill metadata (id, schemas, adapter config)\n  adapter.config.json   # Adapter-specific configuration\n  tests/\n    basic.eval.json     # A starter benchmark with one sample task\n  skill.py              # (python_script template only) JSON-RPC entrypoint\n\n3. Evaluate Skills\n\nRun benchmark evaluations against one or more skills. This is the core command.\n\neval-skills eval \\\n  --skills ./skills/calculator/skill.json ./skills/search/ \\\n  --benchmark coding-easy \\\n  --concurrency 4 \\\n  --timeout 30000 \\\n  --retries 2 \\\n  --runs 3 \\\n  --evaluator exact \\\n  --format json markdown html \\\n  --output-dir ./reports \\\n  --exit-on-fail --min-completion 0.8 \\\n  --store ./eval-skills.db\n\nOption\tDescription\tDefault\n--skills <paths...>\tRequired. Skill file(s) or directory(ies)\t—\n--benchmark <id|path>\tBuilt-in benchmark ID or path to benchmark.json\tcoding-easy\n--tasks <file>\tCustom tasks JSON file (replaces benchmark)\t—\n--concurrency <n>\tNumber of parallel task executions\t4\n--timeout <ms>\tPer-task timeout in milliseconds\t30000\n--retries <n>\tRetry count on task failure (with incremental backoff)\t0\n--runs <n>\tRepeat evaluation N times for consistency scoring\t1\n--evaluator <type>\tDefault scorer type (see Scorer Types below)\texact\n--format <formats...>\tOutput formats: json, markdown, html\tjson markdown\n--output-dir <dir>\tReport output directory\t./reports\n--exit-on-fail\tExit with code 1 if any skill falls below threshold\tdisabled\n--min-completion <rate>\tThreshold for --exit-on-fail\t0.7\n--dry-run\tValidate configuration only; do not execute tasks\tdisabled\n--benchmarks-dir <dir>\tDirectory containing built-in benchmarks\t./benchmarks\n--store <path>\tSQLite database path for persistent result storage\t./eval-skills.db\n-c, --config <path>\tPath to eval-skills.config.yaml\tauto-detected\n\nEvaluation flow:\n\nLoad skills from --skills paths (supports both single skill.json and directories)\nLoad benchmark tasks from --benchmark or --tasks\nBuild the cartesian product: skills x tasks x runs\nExecute all task items concurrently (controlled by --concurrency, with timeout and retry)\nScore each result using the appropriate scorer\nAggregate into SkillCompletionReport per skill\nWrite reports to --output-dir\n4. Select Skills\n\nFilter and rank skills based on evaluation reports using a multi-dimensional strategy.\n\neval-skills select \\\n  --from ./skills \\\n  --reports ./reports/eval-result.json \\\n  --strategy ./strategy.yaml \\\n  --min-completion 0.8 \\\n  --top-k 5 \\\n  --output ./selected.json\n\nOption\tDescription\tDefault\n--from <path>\tRequired. Candidate skills directory or JSON file\t—\n--reports <file>\tEvaluation reports JSON file\t—\n--strategy <file>\tSelectStrategy YAML/JSON file\tbuilt-in default\n--min-completion <rate>\tOverride minimum completion rate filter\t—\n--top-k <n>\tReturn only the top K results\tall\n--output <file>\tWrite selected skills to file\tstdout\n\nSelection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK\n\nExample strategy.yaml:\n\nfilters:\n  minCompletionRate: 0.8\n  maxErrorRate: 0.1\n  maxLatencyP95Ms: 5000\n  adapterTypes: [http, subprocess]\n  requiredTags: [production-ready]\nsortBy: compositeScore\norder: desc\ntopK: 5\n\n5. Run Pipeline\n\nExecute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command.\n\neval-skills run \\\n  --query \"math\" \\\n  --benchmark coding-easy \\\n  --skills-dir ./skills \\\n  --top-k 3 \\\n  --min-completion 0.7 \\\n  --format json markdown \\\n  --output-dir ./reports\n\n\nThis command automates the entire process:\n\nFind — scans --skills-dir and optionally filters by --query\nEval — evaluates all candidate skills against --benchmark\nSelect — filters and ranks results using --min-completion, --top-k, and optional --strategy\nReport — generates output files in all requested --formats\n6. Generate & Compare Reports\nConvert report format\neval-skills report convert \\\n  --input ./reports/eval-result.json \\\n  --format html \\\n  --output ./reports/eval-result.html\n\n\nSupported output formats: markdown, html.\n\nDiff two reports (regression detection)\neval-skills report diff \\\n  ./reports/v1.json ./reports/v2.json \\\n  --label-a \"v1.0\" --label-b \"v2.0\" \\\n  --output ./reports/diff.md\n\n\nGenerates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows.\n\n7. Initialize Project\neval-skills init --dir .\n\n\nCreates the project scaffold:\n\neval-skills.config.yaml — global configuration\nskills/ — directory for skill definitions\nbenchmarks/ — directory for benchmark files\nreports/ — directory for evaluation output\n8. Manage Configuration\n# List all current configuration values\neval-skills config list\n\n# Get a specific value (supports dot notation)\neval-skills config get llm.model\n\n# Set a value (persisted to ~/.eval-skills/config.yaml)\neval-skills config set concurrency 8\neval-skills config set llm.model gpt-4o\neval-skills config set llm.temperature 0\n\n\nConfiguration is resolved in priority order:\n\nCLI flags (highest priority)\neval-skills.config.yaml in current directory\n~/.eval-skills/config.yaml\nBuilt-in defaults (concurrency: 4, timeoutMs: 30000, outputDir: ./reports)\nScorer Types\n\nEach task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output.\n\nType\tAliases\tDescription\tScore Range\nexact_match\texact\tStrict equality comparison. Supports caseSensitive option.\t0 or 1\ncontains\t—\tChecks for the presence of all specified keywords in the output. Partial credit: matched_keywords / total_keywords.\t0.0 ~ 1.0\njson_schema\tschema\tValidates output against a JSON Schema (using Ajv).\t0 or 1\nllm_judge\t—\tSends the output + expected rubric to an LLM (configurable model) for quality rating.\t0.0 ~ 1.0\ncustom\t—\tLoads a custom scorer from expectedOutput.customScorerPath.\t0.0 ~ 1.0\nEvaluation Metrics\n\nEvery evaluation produces a SkillCompletionReport with these metrics:\n\nMetric\tDescription\tFormula\nCompletion Rate\tFraction of tasks that passed\tpass_count / total_count\nPartial Score\tMean score across all tasks\tmean(task_scores)\nError Rate\tFraction of tasks that errored or timed out\t(error_count + timeout_count) / total_count\nConsistency Score\tStability across multiple runs (requires --runs >= 2)\t1 - stddev(per_run_completion_rates)\nP50 / P95 / P99 Latency\tResponse time percentiles\tSorted percentile of latencyMs\nComposite Score\tWeighted overall quality score\t0.5 * CR + 0.2 * (1 - latP95_norm) + 0.3 * (1 - ER)\nBuilt-in Benchmarks\nID\tDomain\tTasks\tScoring\tDescription\ncoding-easy\tcoding\t20\tmean / exact_match\tMath expressions, string reversal, palindrome detection\nskill-quality\ttool-use\t5\tmean / contains\tMetadata completeness, description quality, structure checks\nweb-search-basic\tweb\t8\tmean / contains + schema\tFactual queries, keyword verification, structured output validation\ngaia-v1\tgeneral\t—\tmean\tPlaceholder for GAIA benchmark Level 1 tasks\ntoolbench-lite\ttool-use\t—\tmean\tPlaceholder for ToolBench single-tool scenarios\nCustom Benchmark\n\nCreate a benchmark.json file:\n\n{\n  \"id\": \"my-benchmark\",\n  \"name\": \"My Custom Benchmark\",\n  \"version\": \"1.0.0\",\n  \"domain\": \"general\",\n  \"scoringMethod\": \"mean\",\n  \"maxLatencyMs\": 30000,\n  \"metadata\": { \"source\": \"internal\", \"lastUpdated\": \"2026-02-28\" },\n  \"tasks\": [\n    {\n      \"id\": \"task_001\",\n      \"description\": \"Test basic addition\",\n      \"inputData\": { \"expression\": \"2+3\" },\n      \"expectedOutput\": { \"type\": \"exact\", \"value\": \"5\" },\n      \"evaluator\": { \"type\": \"exact\" },\n      \"timeoutMs\": 10000,\n      \"tags\": [\"math\"]\n    },\n    {\n      \"id\": \"task_002\",\n      \"description\": \"Test keyword presence\",\n      \"inputData\": { \"query\": \"TypeScript\" },\n      \"expectedOutput\": { \"type\": \"contains\", \"keywords\": [\"JavaScript\", \"Microsoft\"] },\n      \"evaluator\": { \"type\": \"contains\", \"caseSensitive\": false },\n      \"timeoutMs\": 15000,\n      \"tags\": [\"search\"]\n    }\n  ]\n}\n\neval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json\n\nAdapter Types\n\nSkills communicate through adapters. The adapter type is specified in skill.json via adapterType.\n\nAdapter\tProtocol\tHow it works\tKey config\nhttp\tREST POST\tSends POST { skillId, version, input } to skill.entrypoint. Supports Bearer / API-Key auth via env vars.\tbaseUrl, authType, authTokenEnvKey\nsubprocess\tJSON-RPC 2.0 over stdin/stdout\tSpawns skill.entrypoint (e.g. python3 skill.py), writes JSON-RPC request to stdin, reads response from stdout.\tcommand, args\nmcp\tMCP Protocol\t(Phase 2) Native Model Context Protocol integration via @modelcontextprotocol/sdk.\t—\nWorkflow Examples\nEvaluating a Single Skill\n# 1. Create a skill skeleton\neval-skills create --name my_calc --from-template python_script\n\n# 2. Implement your logic in skills/my_calc/skill.py\n\n# 3. Run evaluation against the coding-easy benchmark\neval-skills eval \\\n  --skills ./skills/my_calc/skill.json \\\n  --benchmark coding-easy \\\n  --runs 3 \\\n  --format json markdown\n\n# 4. Review the report\ncat ./reports/eval-result-*.md\n\nComparing Multiple Candidate Skills\n# 1. Discover candidates\neval-skills find --query \"weather\" --skills-dir ./skills\n\n# 2. Evaluate all candidates on the same benchmark\neval-skills eval \\\n  --skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \\\n  --benchmark web-search-basic \\\n  --runs 3\n\n# 3. Select the best\neval-skills select \\\n  --from ./skills \\\n  --reports ./reports/eval-result-*.json \\\n  --min-completion 0.8 \\\n  --top-k 2\n\n# 4. Compare two versions\neval-skills report diff \\\n  ./reports/v1.json ./reports/v2.json \\\n  --label-a \"weather_v1\" --label-b \"weather_v2\"\n\nFull Pipeline (One Command)\neval-skills run \\\n  --skills-dir ./skills \\\n  --benchmark coding-easy \\\n  --top-k 3 \\\n  --min-completion 0.7 \\\n  --format json markdown html \\\n  --output-dir ./reports\n\nCI/CD Quality Gate\n# In your CI pipeline — fail the build if completion rate drops below 80%\neval-skills eval \\\n  --skills ./skills/production_skill \\\n  --benchmark coding-easy \\\n  --exit-on-fail \\\n  --min-completion 0.8 \\\n  --format json\n\nRegression Detection\n# Compare today's evaluation against the baseline\neval-skills report diff \\\n  ./reports/baseline.json ./reports/latest.json \\\n  --label-a \"baseline\" --label-b \"latest\" \\\n  --output ./reports/regression-check.md\n\nBest Practices\n\nAlways use --runs 3 or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs.\n\nUse --exit-on-fail in CI/CD pipelines to enforce quality gates. Set --min-completion to your acceptable threshold (recommended: 0.8 for production skills).\n\nCreate domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter.\n\nUse report diff after every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report.\n\nUse --dry-run before long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks.\n\nPersist results with --store to track skill quality over time. The SQLite store enables historical trend queries.\n\nStart with --concurrency 1 when debugging a failing skill, then increase for production benchmarking.\n\nTag your benchmark tasks to enable per-category analysis (e.g., filter by math, string, edge-case).\n\nSkill JSON Schema\n\nEvery skill must provide a skill.json that conforms to this structure:\n\n{\n  \"id\": \"my_skill_v1\",\n  \"name\": \"My Skill\",\n  \"version\": \"1.0.0\",\n  \"description\": \"Does something useful\",\n  \"tags\": [\"utility\", \"math\"],\n  \"inputSchema\": {\n    \"type\": \"object\",\n    \"properties\": { \"query\": { \"type\": \"string\" } },\n    \"required\": [\"query\"]\n  },\n  \"outputSchema\": {\n    \"type\": \"object\",\n    \"properties\": { \"result\": { \"type\": \"string\" } }\n  },\n  \"adapterType\": \"subprocess\",\n  \"entrypoint\": \"python3 skill.py\",\n  \"metadata\": {\n    \"author\": \"Your Name\",\n    \"license\": \"MIT\",\n    \"homepage\": \"https://github.com/you/my-skill\"\n  }\n}\n\n\nValidation rules:\n\nid: lowercase alphanumeric with _ or -, non-empty\nversion: semver format (X.Y.Z)\nadapterType: one of http, subprocess, mcp, langchain, custom\nentrypoint: non-empty string (URL for http, command for subprocess)\nGlobal Options\n\nThese options are available on all commands:\n\nOption\tDescription\n-c, --config <path>\tPath to configuration file\n--json\tJSON output format (CI-friendly)\n--no-color\tDisable colored output\n-v, --verbose\tVerbose logging\n--version\tShow version\n-h, --help\tShow help"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/isLinXu/eval-skills",
    "publisherUrl": "https://clawhub.ai/isLinXu/eval-skills",
    "owner": "isLinXu",
    "version": "0.1.1",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/eval-skills",
    "downloadUrl": "https://openagent3.xyz/downloads/eval-skills",
    "agentUrl": "https://openagent3.xyz/skills/eval-skills/agent",
    "manifestUrl": "https://openagent3.xyz/skills/eval-skills/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/eval-skills/agent.md"
  }
}