Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.
Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Production Prometheus setup covering scrape configuration, service discovery, recording rules, alert rules, and operational best practices for infrastructure and application monitoring.
ScenarioExampleSet up metrics collectionNew service needs Prometheus scrapingConfigure service discoveryK8s pods, file-based, or static targetsCreate recording rulesPre-compute expensive PromQL queriesDesign alert rulesSLO-based alerts for availability and latencyProduction deploymentHA setup with retention and storage planningTroubleshoot scrapingTargets down, metrics missing, relabeling issues
Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD ↑ │ client libraries ├──→ Grafana (dashboards) (prom client) └──→ Thanos/Cortex (long-term storage)
helm repo add prometheus-community \ https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageVolumeSize=50Gi
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production region: us-west-2 alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] rule_files: - /etc/prometheus/rules/*.yml scrape_configs: # Self-monitoring - job_name: prometheus static_configs: - targets: ["localhost:9090"] # Node exporters - job_name: node-exporter static_configs: - targets: ["node1:9100", "node2:9100", "node3:9100"] relabel_configs: - source_labels: [__address__] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}" # Application metrics (TLS) - job_name: my-app scheme: https metrics_path: /metrics tls_config: ca_file: /etc/prometheus/ca.crt static_configs: - targets: ["app1:9090", "app2:9090"]
scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod Pod annotations to enable scraping: metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics"
scrape_configs: - job_name: file-sd file_sd_configs: - files: ["/etc/prometheus/targets/*.json"] refresh_interval: 5m targets/production.json: [{ "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } }]
MethodBest ForDynamicstatic_configsFixed infrastructure, devNofile_sd_configsCM-managed inventoriesYes (file watch)kubernetes_sd_configsK8s workloadsYes (API watch)consul_sd_configsConsul service meshYes (Consul watch)ec2_sd_configsAWS EC2 instancesYes (API poll)
Pre-compute expensive queries for dashboard and alert performance: # /etc/prometheus/rules/recording_rules.yml groups: - name: api_metrics interval: 15s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) - record: job:http_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) - record: job:http_error_rate:ratio expr: job:http_errors:rate5m / job:http_requests:rate5m - record: job:http_duration:p95 expr: > histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - name: resource_metrics interval: 30s rules: - record: instance:node_cpu:utilization expr: > 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - record: instance:node_memory:utilization expr: > 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) - record: instance:node_disk:utilization expr: > 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
level:metric_name:operations PartExampleMeaningleveljob:, instance:Aggregation levelmetric_namehttp_requestsBase metricoperations:rate5m, :ratioApplied functions
# /etc/prometheus/rules/alert_rules.yml groups: - name: availability rules: - alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} is down" description: "{{ $labels.job }} down for >1 minute" - alert: HighErrorRate expr: job:http_error_rate:ratio > 0.05 for: 5m labels: severity: warning annotations: summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}" - alert: HighP95Latency expr: job:http_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "P95 latency {{ $value }}s for {{ $labels.job }}" - name: resources rules: - alert: HighCPU expr: instance:node_cpu:utilization > 80 for: 5m labels: { severity: warning } annotations: summary: "CPU {{ $value }}% on {{ $labels.instance }}" - alert: HighMemory expr: instance:node_memory:utilization > 85 for: 5m labels: { severity: warning } annotations: summary: "Memory {{ $value }}% on {{ $labels.instance }}" - alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: { severity: critical } annotations: summary: "Disk {{ $value }}% on {{ $labels.instance }}"
SeverityThresholdResponsecriticalService down, data loss riskPage on-call immediatelywarningDegraded, approaching limitInvestigate within hoursinfoNotable but not urgentReview in next business day
# Validate config syntax promtool check config prometheus.yml # Validate rule files promtool check rules /etc/prometheus/rules/*.yml # Test a query promtool query instant http://localhost:9090 'up' # Reload config without restart curl -X POST http://localhost:9090/-/reload
PracticeDetailNaming: prefix_name_unitSnake_case, _total for counters, _seconds/_bytes for unitsScrape intervals 15–60sShorter wastes resources and storageRecording rules for dashboardsPre-compute anything queried repeatedlyMonitor Prometheus itselfprometheus_tsdb_*, scrape_duration_secondsHA deployment2+ instances scraping same targetsRetention planningMatch --storage.tsdb.retention.time to disk capacityFederation for scaleGlobal Prometheus aggregates from regional instancesLong-term storageThanos or Cortex for >30d retention
ProblemDiagnosisFixTarget shows DOWNCheck /targets page for errorFix firewall, verify endpoint, check TLSMetrics missingQuery up{job="x"}Verify scrape config, check /metrics endpointHigh cardinalityprometheus_tsdb_head_series growingDrop high-cardinality labels with metric_relabel_configsStorage filling upCheck prometheus_tsdb_storage_*Reduce retention, add disk, enable compactionSlow queriesCheck prometheus_engine_query_duration_secondsAdd recording rules, reduce range, limit seriesConfig not appliedCheck prometheus_config_last_reload_successfulFix syntax, POST /-/reload
Anti-PatternWhyDo InsteadScrape interval < 5sOverwhelms targets and storageUse 15–60s intervalsHigh-cardinality labels (user ID, request ID)Explodes TSDB series countUse logs for high-cardinality dataAlert without for durationFires on transient spikesAlways set for: 1m minimumSkip recording rulesDashboards compute expensive queries every loadPre-compute with recording rulesStore secrets in prometheus.ymlConfig often in GitUse file-based secrets or env substitutionIgnore up metricMiss targets silently going downAlert on up == 0 for all jobsSingle Prometheus instance in prodSingle point of failureRun 2+ replicas with shared targetsUnbounded retentionDisk fills, Prometheus crashesSet explicit --storage.tsdb.retention.time
TemplateDescriptiontemplates/prometheus.ymlFull config with static, file-based, and K8s discoverytemplates/alert-rules.yml25+ alert rules by categorytemplates/recording-rules.ymlPre-computed metrics for HTTP, latency, resources, SLOs
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.