Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Provides a comprehensive testing methodology for AI software, covering strategy design, unit, integration, and end-to-end tests with coverage and reporting g...
Provides a comprehensive testing methodology for AI software, covering strategy design, unit, integration, and end-to-end tests with coverage and reporting g...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
The definitive testing methodology for AI agents. From test strategy to execution, coverage to reporting β everything you need to ship quality software.
Before writing a single test, design the strategy.
project: name: "" type: web-app | api | mobile | library | cli | data-pipeline languages: [typescript, python, go, java] frameworks: [react, express, django, spring] risk_profile: data_sensitivity: low | medium | high | critical # PII, financial, health user_impact: internal | b2b | b2c | life-safety deployment_frequency: daily | weekly | monthly regulatory: [none, SOC2, HIPAA, PCI-DSS, GDPR] test_scope: in_scope: [] # Features, services, components out_of_scope: [] # Explicitly excluded (with reason) environments: dev: { url: "", db: "local" } staging: { url: "", db: "seeded" } prod: { url: "", smoke_only: true }
Risk ProfileUnitIntegrationE2EPerformanceSecurityAccessibilityInternal toolβ Coreβ APIβ οΈ Happy pathββ οΈ BasicβB2B SaaSβ Fullβ Fullβ Critical flowsβ Loadβ OWASP Top 10β WCAG AAB2C high-trafficβ Fullβ Fullβ Fullβ Stress + soakβ Fullβ WCAG AAFinancial/Healthβ Full + mutationβ Full + contractβ Full + chaosβ Full suiteβ Pen testβ WCAG AAA
/ E2E \ 5-10% β Critical user journeys only / Integration \ 20-30% β API contracts, service boundaries / Unit Tests \ 60-70% β Business logic, pure functions Anti-pattern: Ice cream cone β More E2E than unit tests. Slow, flaky, expensive. Fix by pushing test coverage DOWN the pyramid. Anti-pattern: Hourglass β Lots of unit + E2E, no integration. Misses contract bugs between services.
Every unit test follows this structure: describe('PricingCalculator', () => { // Group by behavior, not by method describe('when customer has volume discount', () => { it('applies tiered pricing above threshold', () => { // ARRANGE β Set up the scenario const calculator = new PricingCalculator(); const customer = createCustomer({ tier: 'enterprise', units: 150 }); // ACT β Execute the behavior under test const price = calculator.calculate(customer); // ASSERT β Verify the outcome (ONE logical assertion) expect(price).toEqual({ subtotal: 12000, discount: 1800, // 15% volume discount total: 10200, }); }); }); });
Format: [unit] [scenario] [expected behavior] β Good: PricingCalculator applies 15% discount when units exceed 100 UserService throws NotFoundError when user ID is invalid parseDate returns null for malformed ISO strings β Bad: test1, should work, calculates price
Business logic β Pricing, rules, calculations, state machines Data transformations β Parsers, formatters, serializers, mappers Edge cases β Boundaries, null/undefined, empty collections, overflow Error handling β Every catch block, every validation path Pure functions β Easiest to test, highest ROI
Framework internals (React rendering, Express routing) Simple getters/setters with no logic Third-party library behavior Implementation details (private methods, internal state)
Dependency TypeStrategyExampleDatabaseMock the repository/DAOjest.mock('./userRepo')HTTP APIMock the client or use MSWmsw.http.get('/api/users', ...)File systemMock fs or use temp dirsjest.mock('fs/promises')Time/DateFake timersjest.useFakeTimers()RandomnessSeed or mockjest.spyOn(Math, 'random')EnvironmentOverride env varsprocess.env.NODE_ENV = 'test' Rule: Mock at boundaries, not internals. If you're mocking a class you own, your design might need refactoring.
MetricMinimumGoodExcellentLine coverage70%85%95%+Branch coverage60%80%90%+Function coverage75%90%95%+Critical path coverage100%100%100% Warning: 100% coverage β quality. Coverage measures what code ran, not what was verified. A test with no assertions has coverage but no value.
For every API endpoint, test: endpoint: POST /api/orders tests: happy_path: - Valid request returns 201 with order ID - Response matches schema - Database record created correctly - Events/webhooks fired validation: - Missing required fields β 400 with field errors - Invalid data types β 400 with type errors - Business rule violations β 422 with explanation authentication: - No token β 401 - Expired token β 401 - Wrong role β 403 - Valid token β proceeds edge_cases: - Duplicate request (idempotency) β same response - Concurrent requests β no race condition - Maximum payload size β 413 or graceful handling - Special characters in input β no injection error_handling: - Database down β 503 with retry hint - External service timeout β 504 or fallback - Rate limit exceeded β 429 with retry-after
When services communicate, test the contract: contract: consumer: order-service provider: payment-service interactions: - description: "Process payment" request: method: POST path: /payments body: amount: 99.99 currency: USD order_id: "ord_123" response: status: 200 body: payment_id: "pay_xxx" # string, not null status: "completed" # enum: completed|pending|failed breaking_changes: # NEVER do these without versioning - Remove a field from response - Change a field's type - Add a required field to request - Change the URL path - Change error response format
Each test gets a clean state β Use transactions that rollback, or truncate between tests Use factories, not fixtures β createUser({ role: 'admin' }) > hardcoded SQL dumps Test migrations β Run migrate-up, migrate-down, migrate-up (roundtrip) Test constraints β Unique violations, FK cascades, NOT NULL Test queries β Especially complex JOINs, aggregations, window functions
Identify and test the flows that generate revenue or block users: critical_journeys: - name: "Sign up β First value" steps: - Visit landing page - Click sign up - Fill registration form - Verify email - Complete onboarding - Perform first key action max_duration: 3 minutes - name: "Purchase flow" steps: - Browse products - Add to cart - Enter shipping - Enter payment - Confirm order - Receive confirmation email max_duration: 2 minutes - name: "Login β Core task β Logout" steps: - Login (password + SSO + MFA variants) - Navigate to core feature - Complete primary workflow - Verify result - Logout max_duration: 1 minute
Test user behavior, not implementation β Click buttons by text/role, not by CSS class Use data-testid sparingly β Only when no accessible selector exists Wait for state, not time β waitFor(element) not sleep(3000) Isolate test data β Each test creates its own users/data Run in CI with retries β 1 retry for flaky network, investigate if >5% flake rate
getByRole('button', { name: 'Submit' }) β Accessible, resilient getByLabelText('Email') β Form-specific, accessible getByText('Welcome back') β Content-based getByTestId('submit-btn') β Explicit test hook querySelector('.btn-primary') β β Fragile, breaks on CSS changes
SymptomLikely CauseFixPasses locally, fails in CITiming/race conditionAdd explicit waits, check CI resource limitsFails intermittentlyShared state between testsIsolate test data, reset stateFails after deployEnvironment differenceCheck env vars, API versions, feature flagsFails at specific timeTime-dependent logicMock dates/times, avoid time-sensitive assertionsFails in parallelResource contentionUse unique ports/DBs per worker Rule: Quarantine flaky tests within 24 hours. A flaky test suite that everyone ignores is worse than no tests.
performance_tests: smoke: vus: 5 duration: 1m purpose: "Verify test works" load: vus: 100 # Expected concurrent users duration: 10m ramp_up: 2m purpose: "Normal traffic behavior" thresholds: p95_response: <500ms error_rate: <1% stress: vus: 300 # 3x expected load duration: 15m ramp_up: 5m purpose: "Find breaking point" soak: vus: 80 duration: 2h purpose: "Memory leaks, connection exhaustion" spike: stages: - { vus: 50, duration: 2m } - { vus: 500, duration: 30s } # Sudden spike - { vus: 50, duration: 2m } purpose: "Recovery behavior"
MetricWeb AppAPIBackground JobResponse time (p50)<200ms<100msN/AResponse time (p95)<1s<500msN/AResponse time (p99)<3s<1sN/AThroughput>100 rps>500 rps>1000/minError rate<0.1%<0.1%<0.5%CPU usage<70%<70%<90%Memory growth<5%/hr<2%/hr<10%/hr
db_performance: query_tests: - name: "Dashboard aggregate query" baseline: 50ms max_acceptable: 200ms with_1M_rows: measure with_10M_rows: measure index_verification: - Run EXPLAIN ANALYZE on all critical queries - Verify no sequential scans on tables >10K rows - Check index usage statistics weekly connection_pool: - Test at max connections - Verify graceful handling when pool exhausted - Monitor connection wait time
security_tests: A01_broken_access_control: - [ ] Horizontal privilege escalation (access other user's data) - [ ] Vertical privilege escalation (access admin functions) - [ ] IDOR (Insecure Direct Object References) - [ ] Missing function-level access control - [ ] CORS misconfiguration A02_cryptographic_failures: - [ ] Sensitive data in transit (TLS 1.2+) - [ ] Sensitive data at rest (encryption) - [ ] Password hashing (bcrypt/argon2, not MD5/SHA) - [ ] No secrets in code/logs/URLs A03_injection: - [ ] SQL injection (parameterized queries) - [ ] NoSQL injection - [ ] Command injection (OS commands) - [ ] XSS (stored, reflected, DOM-based) - [ ] Template injection (SSTI) A04_insecure_design: - [ ] Rate limiting on auth endpoints - [ ] Account lockout after N failures - [ ] CAPTCHA on public forms - [ ] Business logic abuse scenarios A05_security_misconfiguration: - [ ] Default credentials removed - [ ] Error messages don't leak stack traces - [ ] Security headers set (CSP, HSTS, X-Frame-Options) - [ ] Directory listing disabled - [ ] Unnecessary HTTP methods disabled A07_auth_failures: - [ ] Brute force protection - [ ] Session fixation - [ ] Session timeout - [ ] JWT validation (signature, expiry, issuer) - [ ] MFA bypass attempts
Test every user input with: injection_payloads: sql: ["' OR 1=1--", "'; DROP TABLE users;--", "1 UNION SELECT * FROM users"] xss: ["<script>alert(1)</script>", "<img onerror=alert(1) src=x>", "javascript:alert(1)"] path_traversal: ["../../etc/passwd", "..\\..\\windows\\system32", "%2e%2e%2f"] command: ["; ls -la", "| cat /etc/passwd", "$(whoami)", "`id`"] boundary_values: strings: ["", " ", "a"*10000, null, undefined, "emoji: π―", "unicode: Γ© Γ ΓΌ", "rtl: Ω Ψ±ΨΨ¨Ψ§"] numbers: [0, -1, 2147483647, -2147483648, NaN, Infinity, 0.1+0.2] arrays: [[], [null], Array(10000)] dates: ["1970-01-01", "2099-12-31", "invalid-date", "2024-02-29", "2023-02-29"]
NeedJavaScript/TSPythonGoJavaUnitVitest / Jestpytesttesting + testifyJUnit 5APISupertesthttpx + pytestnet/http/httptestRestAssuredE2E (browser)PlaywrightPlaywrightchromedpSeleniumPerformancek6LocustvegetaGatlingContractPactPactPactPactSecurityZAP + customBandit + customgosecSpotBugs
pipeline: stage_1_fast: # <2 min, blocks PR - Lint + type check - Unit tests - Security: dependency scan (npm audit / safety) stage_2_thorough: # <10 min, blocks merge - Integration tests - Contract tests - Security: SAST scan - Coverage report + threshold check stage_3_confidence: # <30 min, blocks deploy - E2E critical journeys - Visual regression (if applicable) - Security: container scan stage_4_post_deploy: # After deploy to staging - Smoke tests against staging - Performance baseline check - Security: DAST scan (ZAP) stage_5_production: # After prod deploy - Smoke tests (critical paths only) - Synthetic monitoring enabled - Canary metrics watching
test_data_strategy: unit_tests: approach: factories # Builder pattern, create exactly what you need example: "createUser({ role: 'admin', plan: 'enterprise' })" integration_tests: approach: seeded_database reset: per_test_suite # Transaction rollback or truncate sensitive_data: anonymized # Never use real PII e2e_tests: approach: api_setup # Create data via API before test cleanup: after_each # Delete created data isolation: unique_identifiers # Timestamp or UUID in test data performance_tests: approach: representative_dataset volume: 10x_production # Test with more data than prod generation: faker_libraries # Realistic but synthetic
metrics: test_suite_health: total_tests: 0 passing: 0 failing: 0 skipped: 0 # >5% skipped = tech debt alarm flaky: 0 # >2% flaky = quarantine immediately coverage: line: "0%" branch: "0%" critical_paths: "0%" # Must be 100% execution: unit_duration: "0s" # Target: <30s integration_duration: "0s" # Target: <5m e2e_duration: "0s" # Target: <15m total_ci_time: "0s" # Target: <20m defect_metrics: bugs_found_in_test: 0 bugs_escaped_to_prod: 0 escape_rate: "0%" # Target: <5% mttr: "0h" # Mean time to resolve trends: # Track weekly new_tests_added: 0 tests_deleted: 0 # Healthy deletion = removing redundant tests coverage_delta: "+0%" flake_rate_delta: "+0%"
DimensionWeightScoringTest coverage20%<60%=0, 60-70%=5, 70-80%=10, 80-90%=15, 90%+=20Critical path coverage20%<100%=0, 100%=20Defect escape rate15%>10%=0, 5-10%=5, 2-5%=10, <2%=15Test suite speed10%>30m=0, 20-30m=3, 10-20m=7, <10m=10Flake rate10%>5%=0, 2-5%=3, 1-2%=7, <1%=10Security test coverage10%None=0, Basic=3, OWASP Top 10=7, Full=10Documentation5%None=0, Basic=2, Complete=5Automation ratio10%<50%=0, 50-70%=3, 70-90%=7, 90%+=10 Scoring: 0-40 = π΄ Critical | 41-60 = π‘ Needs Work | 61-80 = π’ Good | 81-100 = π Excellent
accessibility_checklist: level_a: # Minimum compliance - [ ] All images have alt text - [ ] All form inputs have labels - [ ] Color is not the only visual indicator - [ ] Page has proper heading hierarchy (h1βh2βh3) - [ ] All functionality available via keyboard - [ ] Focus is visible and logical - [ ] No content flashes >3 times/second level_aa: # Standard compliance (recommended) - [ ] Color contrast ratio β₯4.5:1 (normal text) - [ ] Color contrast ratio β₯3:1 (large text) - [ ] Text resizable to 200% without loss - [ ] Skip navigation links - [ ] Consistent navigation across pages - [ ] Error suggestions provided - [ ] ARIA landmarks for page regions tools: - axe-core (automated, catches ~30% of issues) - Lighthouse accessibility audit - Manual keyboard navigation test - Screen reader testing (VoiceOver/NVDA)
compatibility_tests: when_updating_api: - [ ] All existing fields still present in response - [ ] No field type changes (stringβnumber) - [ ] New required request fields have defaults - [ ] Deprecated fields still work (with warning header) - [ ] Error format unchanged - [ ] Pagination behavior unchanged - [ ] Rate limits not reduced versioning_strategy: - URL versioning: /v1/users, /v2/users - Header versioning: Accept: application/vnd.api+json;version=2 - Sunset header for deprecated versions - Minimum 6-month deprecation notice
chaos_tests: network: - Service dependency goes down β graceful degradation? - Network latency increases 10x β timeout handling? - DNS resolution fails β fallback behavior? infrastructure: - Database primary fails β replica promotion? - Cache (Redis) goes down β DB fallback works? - Disk fills up β alerting + graceful failure? application: - Memory pressure β OOM handling? - CPU saturation β request queuing? - Certificate expiry β monitoring alert? data: - Corrupt message in queue β dead letter + alert? - Schema migration fails mid-way β rollback works? - Clock skew between services β idempotency holds?
Review requirements β Identify test scenarios before code is written (shift-left) Write test cases β Cover happy path, edge cases, error cases, security Review PR tests β Are tests meaningful? Do they test behavior, not implementation? Run full suite β Unit + integration + E2E for affected areas Report findings β Use the test report template above
Write failing test first β Reproduce the bug as a test Verify fix makes test pass β The test IS the proof Check for regression β Run related test suites Add to regression suite β Bug tests prevent re-introduction
weekly_review: monday: - Review flaky test quarantine β fix or delete - Check coverage trends β declining = tech debt - Review escaped defects β update test strategy friday: - Update test health dashboard - Clean up obsolete tests - Document new testing patterns discovered - Plan next week's testing focus
"Create test strategy for [project/feature]" β Full strategy brief "Write unit tests for [function/class]" β AAA pattern tests with edge cases "Test this API endpoint: [method] [path]" β Full API test checklist "Review these tests for quality" β Test code review with scoring "Generate performance test plan" β k6/Locust test design "Security test [feature/endpoint]" β OWASP-based test checklist "Create test report for [release]" β Formatted test report "What's our test health?" β Dashboard with metrics and recommendations "Find gaps in our test coverage" β Analysis with prioritized recommendations "Help debug this flaky test" β Root cause analysis with fix suggestions "Set up CI test pipeline" β Stage-by-stage pipeline config "Accessibility audit [page/component]" β WCAG checklist with findings
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.