← All skills
Tencent SkillHub Β· AI

Prometheus

Prometheus monitoring patterns, cardinality management, alerting best practices, and PromQL traps.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Prometheus monitoring patterns, cardinality management, alerting best practices, and PromQL traps.

⬇ 0 downloads β˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 11 sections Open source page

Cardinality Explosions

Every unique label combination creates a new time series β€” user_id as label kills Prometheus Avoid high-cardinality labels: user IDs, email addresses, request IDs, timestamps, UUIDs Check cardinality: prometheus_tsdb_head_series metric β€” above 1M series needs attention Use histograms for latency, not per-request labels β€” buckets are fixed cardinality Relabeling can drop dangerous labels before ingestion: labeldrop in scrape config

Histogram vs Summary

Histograms: use for SLOs, aggregatable across instances, buckets defined upfront Summaries: use when you need exact percentiles, cannot aggregate across instances Histogram bucket boundaries must be defined before data arrives β€” wrong buckets = wrong percentiles Default buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10) assume HTTP latency β€” adjust for your use case

Rate and Increase

rate() requires range selector at least 4x scrape interval β€” rate(metric[1m]) with 30s scrape misses data rate() is per-second, increase() is total over range β€” don't confuse them Counter resets on restart β€” rate() handles this, raw delta doesn't irate() uses only last two samples β€” too spiky for alerting, use rate() for alerts

Alerting Mistakes

Alert on symptoms, not causes β€” "high latency" not "high CPU" for clause prevents flapping: for: 5m means condition must hold 5 minutes before firing Missing for clause = fires immediately on first match = noisy Alerts need runbook_url label β€” on-call needs to know what to do, not just that something's wrong Test alerts with promtool check rules β€” syntax errors discovered at 3am are bad

PromQL Traps

and is intersection by labels, not boolean AND β€” results must have matching label sets or fills in missing series, doesn't do boolean OR on values {} without metric name is expensive β€” scans all metrics offset goes back in time: metric offset 1h is value from 1 hour ago Comparison operators filter series: http_requests > 100 drops series below 100, doesn't return boolean

Scrape Configuration

honor_labels: true trusts source labels β€” use only when source is authoritative (e.g., Pushgateway) scrape_timeout must be less than scrape_interval β€” otherwise overlapping scrapes Static configs don't reload without restart β€” use file_sd or service discovery for dynamic targets TLS verification disabled (insecure_skip_verify) should be temporary, never permanent

Pushgateway Pitfalls

Pushgateway is for batch jobs, not services β€” services should expose /metrics Metrics persist until deleted β€” stale metrics from dead jobs confuse dashboards Add job and instance labels to distinguish sources β€” default grouping hides failures Delete metrics when job completes: curl -X DELETE http://pushgateway/metrics/job/myjob

Recording Rules

Pre-compute expensive queries: record: job:request_duration_seconds:rate5m Naming convention: level:metric:operations β€” helps identify what rules produce Recording rules update every evaluation interval β€” not instant, plan for slight delay Reduce cardinality with recording rules: aggregate away labels you don't need for alerting

Federation and Remote Write

Federation for pulling from other Prometheus β€” use sparingly, adds latency Remote write for long-term storage β€” Prometheus local storage is not durable Remote write can buffer during outages β€” but buffer is finite, data loss on extended outages Prometheus is not highly available by default β€” run two instances scraping same targets

Common Operational Issues

TSDB corruption on unclean shutdown β€” use --storage.tsdb.wal-compression and monitor disk space Memory grows with series count β€” each series costs ~3KB RAM Compaction pauses during high load β€” leave 40% disk headroom Scrape targets stuck "Unknown" β€” check network, firewall, target actually exposing /metrics

Label Best Practices

Use labels for dimensions you'll filter/aggregate by β€” environment, service, instance Keep label values low-cardinality β€” tens or hundreds, not thousands Consistent naming: snake_case, prefix with domain: http_requests_total, node_cpu_seconds_total le label is reserved for histogram buckets β€” don't use for other purposes

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
1 Docs
  • SKILL.md Primary doc