Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Prometheus monitoring patterns, cardinality management, alerting best practices, and PromQL traps.
Prometheus monitoring patterns, cardinality management, alerting best practices, and PromQL traps.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Every unique label combination creates a new time series β user_id as label kills Prometheus Avoid high-cardinality labels: user IDs, email addresses, request IDs, timestamps, UUIDs Check cardinality: prometheus_tsdb_head_series metric β above 1M series needs attention Use histograms for latency, not per-request labels β buckets are fixed cardinality Relabeling can drop dangerous labels before ingestion: labeldrop in scrape config
Histograms: use for SLOs, aggregatable across instances, buckets defined upfront Summaries: use when you need exact percentiles, cannot aggregate across instances Histogram bucket boundaries must be defined before data arrives β wrong buckets = wrong percentiles Default buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10) assume HTTP latency β adjust for your use case
rate() requires range selector at least 4x scrape interval β rate(metric[1m]) with 30s scrape misses data rate() is per-second, increase() is total over range β don't confuse them Counter resets on restart β rate() handles this, raw delta doesn't irate() uses only last two samples β too spiky for alerting, use rate() for alerts
Alert on symptoms, not causes β "high latency" not "high CPU" for clause prevents flapping: for: 5m means condition must hold 5 minutes before firing Missing for clause = fires immediately on first match = noisy Alerts need runbook_url label β on-call needs to know what to do, not just that something's wrong Test alerts with promtool check rules β syntax errors discovered at 3am are bad
and is intersection by labels, not boolean AND β results must have matching label sets or fills in missing series, doesn't do boolean OR on values {} without metric name is expensive β scans all metrics offset goes back in time: metric offset 1h is value from 1 hour ago Comparison operators filter series: http_requests > 100 drops series below 100, doesn't return boolean
honor_labels: true trusts source labels β use only when source is authoritative (e.g., Pushgateway) scrape_timeout must be less than scrape_interval β otherwise overlapping scrapes Static configs don't reload without restart β use file_sd or service discovery for dynamic targets TLS verification disabled (insecure_skip_verify) should be temporary, never permanent
Pushgateway is for batch jobs, not services β services should expose /metrics Metrics persist until deleted β stale metrics from dead jobs confuse dashboards Add job and instance labels to distinguish sources β default grouping hides failures Delete metrics when job completes: curl -X DELETE http://pushgateway/metrics/job/myjob
Pre-compute expensive queries: record: job:request_duration_seconds:rate5m Naming convention: level:metric:operations β helps identify what rules produce Recording rules update every evaluation interval β not instant, plan for slight delay Reduce cardinality with recording rules: aggregate away labels you don't need for alerting
Federation for pulling from other Prometheus β use sparingly, adds latency Remote write for long-term storage β Prometheus local storage is not durable Remote write can buffer during outages β but buffer is finite, data loss on extended outages Prometheus is not highly available by default β run two instances scraping same targets
TSDB corruption on unclean shutdown β use --storage.tsdb.wal-compression and monitor disk space Memory grows with series count β each series costs ~3KB RAM Compaction pauses during high load β leave 40% disk headroom Scrape targets stuck "Unknown" β check network, firewall, target actually exposing /metrics
Use labels for dimensions you'll filter/aggregate by β environment, service, instance Keep label values low-cardinality β tens or hundreds, not thousands Consistent naming: snake_case, prefix with domain: http_requests_total, node_cpu_seconds_total le label is reserved for histogram buckets β don't use for other purposes
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.