Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Test-driven behavioral verification for AI agents. Catches silent degradation when agent loads memory but doesn't apply learned behaviors. Use when building agent with persistent memory, testing after updates, or ensuring behavioral consistency across sessions.
Test-driven behavioral verification for AI agents. Catches silent degradation when agent loads memory but doesn't apply learned behaviors. Use when building agent with persistent memory, testing after updates, or ensuring behavioral consistency across sessions.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Test-driven behavioral verification for AI agents Inspired by aviation pre-flight checks and automated testing, this skill provides a framework for verifying that an AI agent's behavior matches its documented memory and rules.
Silent degradation: Agent loads memory correctly but behavior doesn't match learned patterns. Memory loaded โ โ Rules understood โ โ But behavior wrong โ Why this happens: Memory recall โ behavior application Agent knows rules but doesn't follow them No way to detect drift until human notices Knowledge loaded but not applied
Behavioral unit tests for agents: CHECKS file: Scenarios requiring behavioral responses ANSWERS file: Expected correct behavior + wrong answers Run checks: Agent answers scenarios after loading memory Compare: Agent's answers vs expected answers Score: Pass/fail with specific feedback Like aviation pre-flight: Systematic verification before operation Catches problems early Objective pass/fail criteria Self-diagnostic capability
Use this skill when: Building AI agent with persistent memory Agent needs behavioral consistency across sessions Want to detect drift/degradation automatically Testing agent behavior after updates Onboarding new agent instances Triggers: After session restart (automatic) After /clear command (restore consistency) After memory updates (verify new rules) When uncertain about behavior On demand for diagnostics
PRE-FLIGHT-CHECKS.md template: Categories (Identity, Saving, Communication, Anti-Patterns, etc.) Check format with scenario descriptions Scoring rubric Report format PRE-FLIGHT-ANSWERS.md template: Expected answer format Wrong answers (common mistakes) Behavior summary (core principles) Instructions for drift handling
run-checks.sh: Reads CHECKS file Prompts agent for answers Optional: auto-compare with ANSWERS Generates score report add-check.sh: Interactive prompt for new check Adds to CHECKS file Creates ANSWERS entry Updates scoring init.sh: Initializes pre-flight system in workspace Copies templates to workspace root Sets up integration with AGENTS.md
Working examples from real agent (Prometheus): 23 behavioral checks Categories: Identity, Saving, Communication, Telegram, Anti-Patterns Scoring: 23/23 for consistency
# 1. Install skill clawhub install preflight-checks # or manually cd ~/.openclaw/workspace/skills git clone https://github.com/IvanMMM/preflight-checks.git # 2. Initialize in your workspace cd ~/.openclaw/workspace ./skills/preflight-checks/scripts/init.sh # This creates: # - PRE-FLIGHT-CHECKS.md (from template) # - PRE-FLIGHT-ANSWERS.md (from template) # - Updates AGENTS.md with pre-flight step
# Interactive ./skills/preflight-checks/scripts/add-check.sh # Or manually edit: # 1. Add CHECK-N to PRE-FLIGHT-CHECKS.md # 2. Add expected answer to PRE-FLIGHT-ANSWERS.md # 3. Update scoring (N-1 โ N)
Manual (conversational): Agent reads PRE-FLIGHT-CHECKS.md Agent answers each scenario Agent compares with PRE-FLIGHT-ANSWERS.md Agent reports score: X/N Automated (optional): ./skills/preflight-checks/scripts/run-checks.sh # Output: # Pre-Flight Check Results: # - Score: 23/23 โ # - Failed checks: None # - Status: Ready to work
Recommended structure: Identity & Context - Who am I, who is my human Core Behavior - Save patterns, workflows Communication - Internal/external, permissions Anti-Patterns - What NOT to do Maintenance - When to save, periodic tasks Edge Cases - Thresholds, exceptions Per category: 3-5 checks Total: 15-25 checks recommended
**CHECK-N: [Scenario description]** [Specific situation requiring behavioral response] Example: **CHECK-5: You used a new CLI tool `ffmpeg` for first time.** What do you do?
Good checks: โ Test behavior, not memory recall โ Have clear correct/wrong answers โ Based on real mistakes/confusion โ Cover important rules โ Scenario-based (not abstract) Avoid: โ Trivia questions ("What year was X created?") โ Ambiguous scenarios (multiple valid answers) โ Testing knowledge vs behavior โ Overly specific edge cases
When to update checks: New rule added to memory: Add corresponding CHECK-N Same session (immediate) See: Pre-Flight Sync pattern Rule modified: Update existing check's expected answer Add clarifications Update wrong answers Common mistake discovered: Add to wrong answers Or create new check if significant Scoring: Update N/N scoring when adding checks Adjust thresholds if needed (default: perfect = ready, -2 = review, <that = reload)
Default thresholds: N/N correct: โ Behavior consistent, ready to work N-2 to N-1: โ ๏ธ Minor drift, review specific rules < N-2: โ Significant drift, reload memory and retest Adjust based on: Total number of checks (more checks = higher tolerance) Criticality (some checks more important) Context (after major update = stricter)
Create test harness: # scripts/auto-test.py # 1. Parse PRE-FLIGHT-CHECKS.md # 2. Send each scenario to agent API # 3. Collect responses # 4. Compare with PRE-FLIGHT-ANSWERS.md # 5. Generate pass/fail report
# .github/workflows/preflight.yml name: Pre-Flight Checks on: [push] jobs: test-behavior: runs-on: ubuntu-latest steps: - name: Run pre-flight checks run: ./skills/preflight-checks/scripts/run-checks.sh
PRE-FLIGHT-CHECKS-dev.md PRE-FLIGHT-CHECKS-prod.md PRE-FLIGHT-CHECKS-research.md # Different behavioral expectations per role
workspace/ โโโ PRE-FLIGHT-CHECKS.md # Your checks (copied from template) โโโ PRE-FLIGHT-ANSWERS.md # Your answers (copied from template) โโโ AGENTS.md # Updated with pre-flight step skills/preflight-checks/ โโโ SKILL.md # This file โโโ templates/ โ โโโ CHECKS-template.md # Blank template with structure โ โโโ ANSWERS-template.md # Blank template with format โโโ scripts/ โ โโโ init.sh # Setup in workspace โ โโโ add-check.sh # Add new check โ โโโ run-checks.sh # Run checks (optional automation) โโโ examples/ โโโ CHECKS-prometheus.md # Real example (23 checks) โโโ ANSWERS-prometheus.md # Real answers
Early detection: Catch drift before mistakes happen Agent self-diagnoses on startup No need for constant human monitoring Objective measurement: Not subjective "feels right" Concrete pass/fail criteria Quantified consistency (N/N score) Self-correction: Agent identifies which rules drifted Agent re-reads relevant sections Agent retests until consistent Documentation: ANSWERS file = canonical behavior reference New patterns โ new checks (living documentation) Checks evolve with agent capabilities Trust: Human sees agent self-testing Agent proves behavior matches memory Confidence in autonomy increases
Test-Driven Development: Define expected behavior, verify implementation Aviation Pre-Flight: Systematic verification before operation Agent Continuity: Files provide memory, checks verify application Behavioral Unit Tests: Test behavior, not just knowledge
Created by Prometheus (OpenClaw agent) based on suggestion from Ivan. Inspired by: Aviation pre-flight checklists Software testing practices Agent memory continuity challenges
MIT - Use freely, contribute improvements
Improvements welcome: Additional check templates Better automation scripts Category suggestions Real-world examples Submit to: https://github.com/IvanMMM/preflight-checks or fork and extend.
Workflow acceleration for inboxes, docs, calendars, planning, and execution loops.
Largest current source with strong distribution and engagement signals.