Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applica...
Complete performance engineering system — profiling, optimization, load testing, capacity planning, and performance culture. Use when diagnosing slow applica...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
From "it's slow" to "here's why and here's the fix" — a complete methodology for measuring, diagnosing, optimizing, and preventing performance problems.
Before touching anything, define the problem. # performance-brief.yaml investigation: reported_by: "" reported_date: "" system: "" # service/app name environment: "" # production, staging, dev problem_statement: symptom: "" # "API response time increased 3x" impact: "" # "15% of users seeing timeouts" since_when: "" # "After deploy v2.14 on Feb 20" affected_scope: "" # "All endpoints" | "Only /search" | "Users in EU" baselines: target_p50: "" # e.g., "200ms" target_p95: "" # e.g., "500ms" target_p99: "" # e.g., "1000ms" current_p50: "" current_p95: "" current_p99: "" throughput_target: "" # e.g., "1000 rps" error_rate_target: "" # e.g., "<0.1%" constraints: budget: "" # time/money for optimization risk_tolerance: "" # "Can we change the schema?" "Can we add caching?" deadline: "" # "Must fix before Black Friday" hypothesis: primary: "" # "N+1 queries in the new recommendation engine" secondary: "" # "Connection pool exhaustion under load" evidence: "" # "Slow query log shows 200+ queries per request"
Set budgets BEFORE building, not after complaints: MetricWeb AppAPIMobileBatch JobP50 response<200ms<100ms<300msN/AP95 response<500ms<250ms<800msN/AP99 response<1s<500ms<1.5sN/AError rate<0.1%<0.01%<0.5%<0.001%Time to Interactive<3sN/A<2sN/AMemory per request<50MB<20MB<100MB<1GBCPU per request<100ms<50ms<200msN/AThroughput100+ rps500+ rpsN/Aitems/min
Never optimize without measuring first. Never measure without a hypothesis.
Is it slow? ├── YES → Where is time spent? │ ├── CPU-bound → Profile CPU (flame graph) │ │ ├── Hot function found → Optimize algorithm/data structure │ │ └── Spread evenly → Architecture problem (too many layers) │ ├── I/O-bound → Profile I/O │ │ ├── Database → Query analysis (Phase 4) │ │ ├── Network → Connection profiling │ │ ├── Disk → I/O scheduler + buffering │ │ └── External API → Caching + async + circuit breaker │ ├── Memory-bound → Profile allocations │ │ ├── GC pressure → Reduce allocations, pool objects │ │ ├── Memory leak → Heap snapshot comparison │ │ └── Cache thrashing → Resize or eviction policy │ └── Concurrency-bound → Profile locks/contention │ ├── Lock contention → Reduce critical section, lock-free structures │ ├── Thread starvation → Pool sizing │ └── Deadlock → Lock ordering analysis └── NO → Define "fast enough" (see budgets above)
Node.js # Built-in profiler (V8) node --prof app.js node --prof-process isolate-*.log > profile.txt # Inspector-based (connect Chrome DevTools) node --inspect app.js # Open chrome://inspect → Profiler → Start # Clinic.js (best overall Node.js profiler) npx clinic doctor -- node app.js npx clinic flame -- node app.js # Flame graph npx clinic bubbleprof -- node app.js # Async bottlenecks # 0x (flame graphs) npx 0x app.js Python # cProfile (built-in) import cProfile import pstats profiler = cProfile.Profile() profiler.enable() # ... code to profile ... profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 # Line profiler (pip install line-profiler) # Add @profile decorator, then: # kernprof -l -v script.py # py-spy (sampling profiler, no code changes) # pip install py-spy # py-spy top --pid <PID> # py-spy record -o profile.svg --pid <PID> # Flame graph # Scalene (CPU + memory + GPU) # pip install scalene # scalene script.py Go // Built-in pprof import ( "net/http" _ "net/http/pprof" "runtime/pprof" ) // HTTP server (add to existing server) // Access: http://localhost:6060/debug/pprof/ go func() { http.ListenAndServe(":6060", nil) }() // CLI analysis // go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 // go tool pprof -http=:8080 profile.out # Web UI Java # async-profiler (best for JVM) # https://github.com/async-profiler/async-profiler ./asprof -d 30 -f profile.html <PID> # JFR (built-in since JDK 11) java -XX:StartFlightRecording=duration=60s,filename=rec.jfr MyApp jfr print --events CPULoad rec.jfr # jstack (thread dump) jstack <PID> > threads.txt
Leak Detection Pattern (any language) 1. Take heap snapshot at T0 2. Run suspected operation N times 3. Force GC 4. Take heap snapshot at T1 5. Compare: objects that grew = potential leak 6. Check: are they reachable? From where? (retention path) Node.js Memory // Heap snapshot const v8 = require('v8'); const fs = require('fs'); function takeSnapshot(label) { const snapshotStream = v8.writeHeapSnapshot(); console.log(`Heap snapshot written to ${snapshotStream}`); } // Process memory monitoring setInterval(() => { const mem = process.memoryUsage(); console.log({ rss_mb: (mem.rss / 1048576).toFixed(1), heap_used_mb: (mem.heapUsed / 1048576).toFixed(1), heap_total_mb: (mem.heapTotal / 1048576).toFixed(1), external_mb: (mem.external / 1048576).toFixed(1), }); }, 10000); Python Memory # tracemalloc (built-in) import tracemalloc tracemalloc.start() # ... code ... snapshot = tracemalloc.take_snapshot() top = snapshot.statistics('lineno') for stat in top[:10]: print(stat) # objgraph (pip install objgraph) import objgraph objgraph.show_most_common_types(limit=20) objgraph.show_growth(limit=10) # Call twice to see what's growing
ProblemBad O()FixGood O()Search unsorted arrayO(n)Sort + binary search, or use Set/MapO(log n) or O(1)Nested loop matchingO(n²)Hash map lookupO(n)Repeated string concatO(n²)StringBuilder/join arrayO(n)Sorting already-sorted dataO(n log n)Check if sorted firstO(n)Finding duplicatesO(n²)Set-based detectionO(n)Frequent min/max of changing dataO(n) per queryHeap/priority queueO(log n)
# Sizing formula pool_size: min(available_cores * 2 + effective_spindle_count, max_connections / num_instances) # Rules of thumb: # - PostgreSQL: connections = cores * 2 + 1 (per pgBouncer docs) # - MySQL: keep total connections < 150 for most workloads # - HTTP clients: match to concurrent request volume # - Redis: usually 5-10 per instance is enough # Warning signs of pool problems: # - "connection timeout" errors under load # - Response time spikes at regular intervals # - Idle connections holding resources # - Connection count hitting max_connections
// BAD: Sequential when independent const user = await getUser(id); const orders = await getOrders(id); const prefs = await getPreferences(id); // Total: user_time + orders_time + prefs_time // GOOD: Parallel when independent const [user, orders, prefs] = await Promise.all([ getUser(id), getOrders(id), getPreferences(id), ]); // Total: max(user_time, orders_time, prefs_time) // GOOD: Controlled concurrency for many items // (npm: p-limit, p-map, or manual semaphore) import pLimit from 'p-limit'; const limit = pLimit(10); // Max 10 concurrent const results = await Promise.all( items.map(item => limit(() => processItem(item))) ); # Python: asyncio for I/O-bound import asyncio async def fetch_all(ids): # Parallel tasks = [fetch_one(id) for id in ids] return await asyncio.gather(*tasks) # Python: ProcessPoolExecutor for CPU-bound from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor(max_workers=4) as pool: results = list(pool.map(cpu_intensive_fn, items))
SYMPTOM: Response time scales linearly with result count DETECTION: Enable query logging, count queries per request # Bad: N+1 users = db.query("SELECT * FROM users LIMIT 100") for user in users: orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}") # Result: 1 + 100 = 101 queries # Fix 1: JOIN SELECT u.*, o.* FROM users u LEFT JOIN orders o ON o.user_id = u.id LIMIT 100 # Fix 2: Batch load (better for large datasets) users = db.query("SELECT * FROM users LIMIT 100") user_ids = [u.id for u in users] orders = db.query(f"SELECT * FROM orders WHERE user_id IN ({','.join(user_ids)})") # Result: 2 queries regardless of count # Fix 3: ORM eager loading # Drizzle: .with(users.orders) # SQLAlchemy: joinedload(User.orders) # Prisma: include: { orders: true }
For every slow query: □ Run EXPLAIN ANALYZE (not just EXPLAIN) □ Check: is it doing a sequential scan on a large table? □ Check: is the row estimate accurate? (bad stats = bad plan) □ Check: are there implicit type casts preventing index use? □ Check: is it sorting more data than needed? (add LIMIT earlier) □ Check: is it joining in the right order? □ Check: can a covering index eliminate table lookups? □ Check: is the query running during peak hours? (schedule if batch)
-- PostgreSQL EXPLAIN output reading guide: EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...; -- Key metrics to check: -- 1. Actual time vs estimated time (large gap = stale stats → ANALYZE) -- 2. Rows actual vs estimated (>10x off = bad stats) -- 3. Seq Scan on large table (>10K rows) = needs index -- 4. Sort with external merge = needs more work_mem or index -- 5. Nested Loop with large outer = consider hash/merge join -- 6. Buffers shared hit vs read (low hit ratio = needs more shared_buffers)
# load-test-plan.yaml test_name: "" target: "" # URL/endpoint date: "" scenarios: - name: "Baseline" description: "Normal traffic pattern" vus: 50 # Virtual users duration: "5m" ramp_up: "30s" think_time: "1-3s" # Pause between requests - name: "Peak" description: "2x normal traffic (expected peak)" vus: 100 duration: "10m" ramp_up: "1m" - name: "Stress" description: "Find the breaking point" vus_start: 50 vus_end: 500 step_duration: "2m" # Add users every 2 min step_size: 50 - name: "Soak" description: "Memory leaks, connection exhaustion" vus: 50 duration: "2h" pass_criteria: p95_response_ms: 500 error_rate_pct: 0.1 throughput_rps: 200
// load-test.js (run: k6 run load-test.js) import http from 'k6/http'; import { check, sleep } from 'k6'; import { Rate, Trend } from 'k6/metrics'; const errorRate = new Rate('errors'); const responseTime = new Trend('response_time'); export const options = { stages: [ { duration: '30s', target: 20 }, // Ramp up { duration: '3m', target: 20 }, // Steady { duration: '30s', target: 50 }, // Peak { duration: '3m', target: 50 }, // Steady peak { duration: '30s', target: 0 }, // Ramp down ], thresholds: { http_req_duration: ['p(95)<500'], // 95% under 500ms errors: ['rate<0.01'], // <1% error rate }, }; export default function () { const res = http.get('https://api.example.com/endpoint'); check(res, { 'status 200': (r) => r.status === 200, 'response < 500ms': (r) => r.timings.duration < 500, }); errorRate.add(res.status !== 200); responseTime.add(res.timings.duration); sleep(Math.random() * 2 + 1); // 1-3s think time }
METRIC │ GOOD │ NEEDS WORK │ POOR │ HOW TO FIX ────────────┼─────────┼────────────┼────────┼──────────────────────── LCP │ <2.5s │ 2.5-4s │ >4s │ Optimize largest image/text FID/INP │ <100ms │ 100-300ms │ >300ms │ Break up long tasks, defer JS CLS │ <0.1 │ 0.1-0.25 │ >0.25 │ Set dimensions, font-display LCP FIXES (in priority order): 1. Preload the LCP image: <link rel="preload" as="image" href="..."> 2. Use responsive images: srcset with correct sizes 3. Serve WebP/AVIF (30-50% smaller) 4. Remove render-blocking CSS/JS from <head> 5. Use CDN for static assets 6. Server-side render the above-fold content INP FIXES: 1. Break long tasks (>50ms) with requestIdleCallback or setTimeout(0) 2. Use web workers for CPU-intensive work 3. Debounce/throttle event handlers 4. Defer non-critical JS: <script defer> or dynamic import() 5. Avoid layout thrashing (batch DOM reads, then batch writes) CLS FIXES: 1. Always set width/height on <img> and <video> 2. Use aspect-ratio CSS for dynamic content 3. Reserve space for ads/embeds 4. Use font-display: swap with size-adjusted fallback 5. Never insert content above existing content
VERTICAL SCALING (scale up): ✓ Quick fix, no code changes ✓ Database servers (often best first move) ✓ Memory-bound workloads ✗ Diminishing returns past 8-16 cores ✗ Single point of failure ✗ Expensive at high end HORIZONTAL SCALING (scale out): ✓ Stateless services (APIs, workers) ✓ Read-heavy workloads (read replicas) ✓ Geographic distribution ✗ Requires stateless design ✗ Adds complexity (load balancing, session management) ✗ Not all workloads parallelize SCALING CHECKLIST: □ Can we optimize the code first? (cheapest option) □ Can we add caching? (often 10-100x improvement) □ Can we add a read replica? (if read-heavy) □ Can we queue and process async? (if latency-tolerant) □ Can we scale vertically? (if CPU/memory bound) □ Do we need horizontal scaling? (if all above exhausted)
# Kubernetes HPA example apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-server spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scale at 70% CPU - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 60 # Wait 1m before scaling up policies: - type: Percent value: 50 # Max 50% increase per step periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5m before scaling down policies: - type: Percent value: 25 # Max 25% decrease per step periodSeconds: 120
# capacity-model.yaml service: "" last_updated: "" current_state: daily_requests: 0 peak_rps: 0 avg_response_ms: 0 instances: 0 cpu_peak_pct: 0 memory_peak_pct: 0 db_connections_peak: 0 storage_used_gb: 0 growth_model: request_growth_monthly_pct: 0 # e.g., 15% storage_growth_monthly_gb: 0 seasonal_peak_multiplier: 0 # e.g., 3x for Black Friday projections: # Formula: current * (1 + growth_rate)^months * seasonal_multiplier 3_month: daily_requests: 0 peak_rps: 0 instances_needed: 0 storage_gb: 0 estimated_cost: "" 6_month: daily_requests: 0 peak_rps: 0 instances_needed: 0 storage_gb: 0 estimated_cost: "" 12_month: daily_requests: 0 peak_rps: 0 instances_needed: 0 storage_gb: 0 estimated_cost: "" headroom_rules: cpu: "Scale when sustained >70% for 5m" memory: "Scale when >80%" storage: "Alert when >75%, expand when >85%" db_connections: "Alert when >80% of max"
# .github/workflows/perf-gate.yml name: Performance Gate on: pull_request jobs: benchmark: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run benchmarks run: | # Run your benchmark suite npm run benchmark -- --json > bench-results.json - name: Compare with baseline run: | # Compare against main branch baseline node scripts/compare-benchmarks.js \ --baseline benchmarks/baseline.json \ --current bench-results.json \ --threshold 10 # Fail if >10% regression - name: Load test (on staging) if: github.base_ref == 'main' run: | k6 run --out json=load-results.json tests/load-test.js # Check thresholds automatically via k6 - name: Bundle size check run: | npm run build node scripts/check-bundle-size.js \ --max-size 250KB \ --max-increase 5%
AUTOMATED CHECKS (run on every PR): □ Unit benchmarks: critical path functions < threshold □ Bundle size: total and per-chunk limits □ Lighthouse CI: Core Web Vitals pass □ Query count: no N+1 regressions (count queries per test) □ Memory: no leak patterns in test suite WEEKLY CHECKS (cron job): □ Production p50/p95/p99 trends (compare to 4-week average) □ Error rate trends □ Database slow query log review □ Infrastructure cost vs traffic ratio □ Cache hit rates MONTHLY REVIEW: □ Capacity model update □ Performance budget review □ Top 10 slowest endpoints → optimization candidates □ Cost-performance analysis □ Load test full suite against staging
Score your system (0-100): MEASUREMENT (25 points): □ (5) Performance budgets defined for all key metrics □ (5) Real User Monitoring (RUM) in production □ (5) Alerting on p95 degradation □ (5) Dashboards visible to team □ (5) Regular load testing PREVENTION (25 points): □ (5) Performance gates in CI/CD □ (5) Bundle size limits enforced □ (5) Query count checks in tests □ (5) Code review includes perf review □ (5) Capacity planning model maintained OPTIMIZATION (25 points): □ (5) Caching strategy documented □ (5) Database indexes reviewed quarterly □ (5) No known N+1 queries □ (5) Connection pools properly sized □ (5) Async patterns used for I/O OPERATIONS (25 points): □ (5) Auto-scaling configured and tested □ (5) Slow query logging enabled □ (5) Memory leak monitoring □ (5) Performance incident runbook exists □ (5) Monthly performance review
1. PREMATURE OPTIMIZATION Problem: Optimizing before measuring Fix: Profile first, optimize the measured bottleneck 2. MICRO-BENCHMARKING IN ISOLATION Problem: Function is fast alone but slow in context (cache, contention) Fix: Always benchmark in realistic conditions with realistic data 3. OPTIMIZING THE WRONG LAYER Problem: Tuning app code when the DB is the bottleneck Fix: Use distributed tracing to find the actual bottleneck 4. CACHING EVERYTHING Problem: Cache invalidation bugs, stale data, memory pressure Fix: Cache selectively using the decision matrix (Phase 3) 5. PREMATURE HORIZONTAL SCALING Problem: Adding instances when single instance is underoptimized Fix: Vertical optimization first, scale second 6. IGNORING TAIL LATENCY Problem: p50 is fine but p99 is terrible Fix: Investigate outliers — they're often the most important users 7. LOAD TESTING IN DEV Problem: Dev environment doesn't match production Fix: Load test against staging with production-like data 8. OPTIMIZING COLD PATHS Problem: Spending time on rarely-executed code Fix: Profile in production to find actual hot paths
TaskRecommended ToolAlternativeHTTP benchmarkingk6wrk, ab, heyCPU profiling (Node)clinic flame0x, --profCPU profiling (Python)py-spyScalene, cProfileCPU profiling (Go)pprofgo tool traceCPU profiling (Java)async-profilerJFR, VisualVMMemory profilinglanguage-specific (see Phase 2)CLI benchmarkinghyperfinetimeBundle analysiswebpack-bundle-analyzersource-map-explorerWeb performanceLighthouseWebPageTestDB query analysisEXPLAIN ANALYZEpgMustard, pganalyzeDistributed tracingJaeger, ZipkinOpenTelemetryAPMDatadog, New RelicGrafana + PrometheusContinuous profilingPyroscopeParca
"Profile this function" → CPU profiling with flame graph "Why is this endpoint slow" → Full investigation brief + profiling "Load test the API" → k6 test design and execution "Check for memory leaks" → Heap snapshot comparison workflow "Optimize this query" → EXPLAIN ANALYZE + index recommendations "Review frontend perf" → Core Web Vitals audit + bundle analysis "Plan capacity for 10x" → Capacity model with projections "Set up perf monitoring" → CI/CD gates + dashboards + alerts "Find the bottleneck" → Profiling decision tree walkthrough "Score our performance" → Performance review checklist (0-100) "Compare before and after" → Benchmark comparison methodology "Reduce bundle size" → Bundle analysis + reduction strategies
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.