โ† All skills
Tencent SkillHub ยท Other

Observability & Reliability Engineering

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...

โฌ‡ 0 downloads โ˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
README.md, SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 62 sections Open source page

Observability & Reliability Engineering

Complete system for building observable, reliable services โ€” from structured logging to incident response to SLO-driven development.

Quick Health Check (/16)

Score your current observability posture: SignalHealthy (2)Weak (1)Missing (0)Structured loggingJSON logs with trace_id correlationLogs exist but unstructuredConsole.log / print statementsMetrics collectionRED/USE metrics with dashboardsSome metrics, no dashboardsNo metricsDistributed tracingFull request path with samplingPartial traces, key services onlyNo tracingAlertingSLO-based alerts with runbooksThreshold alerts, some runbooksNo alerts or all-noiseIncident responseDefined process with roles + post-mortemsAd-hoc response, some docs"Whoever notices fixes it"SLOs definedSLOs with error budgets tracked weeklyInformal availability targetsNo reliability targetsOn-call rotationStructured rotation with escalationInformal "call someone"No on-callCost managementObservability budget tracked monthlySome awareness of costsNo idea what you spend 12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.

Log Architecture

Application โ†’ Structured JSON โ†’ Log Router โ†’ Storage โ†’ Query Engine โ†“ Alert Pipeline

Required Fields (Every Log Line)

FieldTypePurposeExampletimestampISO-8601 UTCWhen2026-02-22T18:30:00.123ZlevelenumSeverityinfo, warn, error, fatalservicestringWhich servicepayment-apiversionstringWhich deployv2.3.1environmentstringWhich envproductionmessagestringWhat happenedPayment processed successfullytrace_idstringRequest correlationabc123def456span_idstringOperation within tracespan_789duration_msnumberHow long142

Contextual Fields (Add Per Domain)

# HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123" # Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD" # Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true

Log Level Decision Tree

Is the process about to crash? โ†’ FATAL (exit after logging) Did an operation fail that needs human attention? โ†’ ERROR (page someone or create ticket) Did something unexpected happen but we recovered? โ†’ WARN (review in daily triage) Is this a normal business event worth recording? โ†’ INFO (audit trail, business metrics) Is this useful for debugging but noisy in production? โ†’ DEBUG (off in prod, on in staging) Is this only useful when stepping through code? โ†’ TRACE (never in production)

Log Level Rules

ERROR means action required โ€” if no one needs to act on it, it's WARN INFO is for business events โ€” not internal implementation details No logging inside tight loops โ€” aggregate and log summary Log at boundaries โ€” API entry/exit, queue consume/publish, DB calls Never log secrets โ€” API keys, tokens, passwords, PII (see scrubbing below)

PII & Secret Scrubbing

scrub_patterns: # Always redact - field_patterns: ["password", "secret", "token", "api_key", "authorization"] action: replace_with_redacted # Hash for correlation without exposure - field_patterns: ["email", "phone", "ssn", "national_id"] action: sha256_hash # Mask partially - field_patterns: ["credit_card", "card_number"] action: mask_last_4 # "****-****-****-1234" # IP anonymization - field_patterns: ["client_ip", "ip_address"] action: zero_last_octet # 203.0.113.0

Logger Setup (By Language)

Node.js (Pino): import pino from 'pino'; import { AsyncLocalStorage } from 'node:async_hooks'; const als = new AsyncLocalStorage<Record<string, string>>(); const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, mixin: () => als.getStore() ?? {}, redact: ['req.headers.authorization', '*.password', '*.token'], timestamp: pino.stdTimeFunctions.isoTime, }); // Middleware: inject context app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), service: 'payment-api', version: process.env.APP_VERSION, }; als.run(ctx, () => next()); }); Python (structlog): import structlog structlog.configure( processors=[ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso", utc=True), structlog.processors.JSONRenderer(), ], ) log = structlog.get_logger() # Bind context per-request: structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id) Go (zerolog): log := zerolog.New(os.Stdout).With(). Timestamp(). Str("service", "payment-api"). Str("version", version). Logger() // Per-request: reqLog := log.With().Str("trace_id", traceID).Logger()

Log Storage Decision

VolumeSolutionRetentionCost<10 GB/dayLoki + Grafana30 days hot, 90 days coldLow10-100 GB/dayElasticsearch / OpenSearch14 days hot, 90 days S3Medium100+ GB/dayClickHouse or Datadog7 days hot, 30 days archiveHighBudget-constrainedLoki + S3 backend90 days all coldVery low

10 Logging Anti-Patterns

#Anti-PatternFix1log.error(err) with no contextAlways include: what operation, what input, what state2Logging request/response bodiesLog only in DEBUG; redact sensitive fields3String concatenation in log messagesUse structured fields: log.info("processed", { order_id, amount })4Catch-and-log-and-rethrowLog at the boundary where you handle it, not every layer5Different log formats per serviceStandardize schema across all services6No log rotation / retention policySet max size + TTL; archive to cold storage7Logging inside hot pathsAggregate: log summary every N items or every interval8Missing correlation IDsPropagate trace_id from first entry point through all services9Boolean log levels (verbose: true)Use standard levels with configurable minimum10Logging PII in plain textImplement scrubbing at the logger level

The RED Method (Request-Driven Services)

For every service endpoint, track: MetricWhatPrometheus ExampleRateRequests per secondhttp_requests_total{method, path, status}ErrorsFailed requests per secondhttp_requests_total{status=~"5.."} / totalDurationLatency distributionhttp_request_duration_seconds{method, path} (histogram)

The USE Method (Infrastructure Resources)

For every resource (CPU, memory, disk, network): MetricWhatExampleUtilization% resource busyCPU usage 78%SaturationQueue depth / backpressure12 requests queuedErrorsResource errors3 disk I/O errors

Golden Signals (Google SRE)

SignalMeaningSourceLatencyTime to serve requestsRED DurationTrafficDemand on the systemRED RateErrorsRate of failed requestsRED ErrorsSaturationHow "full" the service isUSE Saturation

Metric Types & When to Use Each

TypeUse CaseExampleCounterThings that only go upTotal requests, errors, bytes sentGaugeCurrent value that goes up/downActive connections, queue depth, temperatureHistogramDistribution of valuesRequest latency, response sizeSummaryPre-calculated percentilesClient-side latency (when you need exact percentiles) Rule: Use histograms over summaries in most cases โ€” they're aggregatable across instances.

Naming Conventions

# Pattern: <namespace>_<subsystem>_<name>_<unit> http_server_request_duration_seconds http_server_requests_total db_pool_connections_active queue_messages_pending cache_hit_ratio # Rules: # 1. Use snake_case # 2. Include unit suffix (_seconds, _bytes, _total) # 3. _total suffix for counters # 4. Don't include label names in metric name # 5. Use base units (seconds not milliseconds, bytes not kilobytes)

Label Design Rules

RuleWhyExampleKeep cardinality <100 per labelHigh cardinality kills performancestatus="200" not status="200 OK"No user IDs as labelsUnbounded cardinalityUse log correlation insteadNo request paths with IDs/api/users/123 creates millions of seriesNormalize: /api/users/:idMax 5-7 labels per metricEach combo = a time series{method, path, status, service}

Instrumentation Checklist

application_metrics: # HTTP layer - http_request_duration_seconds: histogram {method, path, status} - http_request_size_bytes: histogram {method, path} - http_response_size_bytes: histogram {method, path} - http_requests_in_flight: gauge # Business logic - orders_processed_total: counter {status, payment_method} - order_value_dollars: histogram {payment_method} - user_signups_total: counter {source} # Dependencies - db_query_duration_seconds: histogram {query_type, table} - db_connections_active: gauge {pool} - db_connections_idle: gauge {pool} - cache_requests_total: counter {result: hit|miss} - external_api_duration_seconds: histogram {service, endpoint} - external_api_errors_total: counter {service, error_type} # Queue / async - queue_messages_published_total: counter {queue} - queue_messages_consumed_total: counter {queue, status} - queue_processing_duration_seconds: histogram {queue} - queue_depth: gauge {queue} - queue_consumer_lag: gauge {queue, consumer_group} infrastructure_metrics: # Node exporter / cAdvisor provides these automatically - cpu_usage_percent: gauge {instance} - memory_usage_bytes: gauge {instance} - disk_usage_bytes: gauge {instance, mount} - disk_io_seconds: counter {instance, device} - network_bytes: counter {instance, direction} - container_cpu_usage: gauge {pod, container} - container_memory_usage: gauge {pod, container}

Stack Recommendations

ComponentOptionsRecommendationCollectionPrometheus, OTEL Collector, Datadog AgentPrometheus (free) or OTEL Collector (vendor-neutral)StoragePrometheus, Thanos, Mimir, VictoriaMetricsVictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem)VisualizationGrafana, Datadog, New RelicGrafana (free, extensible)AlertingAlertmanager, Grafana Alerting, PagerDutyAlertmanager + PagerDuty routing

Trace Architecture

Client Request โ†’ API Gateway (root span) โ†’ Auth Service (child span) โ†’ Order Service (child span) โ†’ Database Query (child span) โ†’ Payment Service (child span) โ†’ Stripe API (child span) โ†’ Notification Service (child span) โ†’ Email Provider (child span)

OpenTelemetry Setup

Auto-instrumentation (Node.js): // tracing.ts โ€” import BEFORE anything else import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] }, '@opentelemetry/instrumentation-express': { enabled: true }, })], serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api', }); sdk.start(); Custom spans for business logic: import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service'); async function processPayment(order: Order) { return tracer.startActiveSpan('process-payment', async (span) => { span.setAttributes({ 'order.id': order.id, 'order.amount_cents': order.amountCents, 'payment.method': order.paymentMethod, }); try { const result = await chargeCard(order); span.setAttributes({ 'payment.status': result.status }); return result; } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); }

Sampling Strategies

StrategyWhenConfigAlways OnDev/staging, low traffic (<100 rps)ratio: 1.0ProbabilisticModerate traffic (100-1000 rps)ratio: 0.1 (10%)Rate-limitedHigh traffic (>1000 rps)max_traces_per_second: 100Tail-basedWant all errors + slow requestsCollector-side: keep if error OR duration > p99Parent-basedRespect upstream decisionsIf parent sampled, child sampled Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.

Context Propagation

HeaderStandardFormattraceparentW3C Trace Context00-{trace_id}-{span_id}-{flags}tracestateW3C Trace ContextVendor-specific key-value pairsb3Zipkin B3{trace_id}-{span_id}-{sampled} Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.

Trace Storage

VolumeSolutionRetention<50 GB/dayJaeger + Elasticsearch7 days50-500 GB/dayTempo + S314 days500+ GB/dayTempo + S3 with aggressive sampling7 daysBudget-constrainedJaeger + Badger (local disk)3 days

SLI Selection by Service Type

Service TypePrimary SLISecondary SLIMeasurementAPI / WebAvailability + LatencyError rateServer-side + syntheticData pipelineFreshness + CorrectnessThroughputPipeline timestamps + checksumsStorageDurability + AvailabilityLatencyChecksums + uptime monitoringStreamingThroughput + LatencyMessage loss rateConsumer lag + e2e latencyBatch jobsSuccess rate + FreshnessDurationJob scheduler metrics

SLO Definition Template

slo: name: "Payment API Availability" service: payment-api owner: payments-team sli: type: availability definition: "Proportion of non-5xx responses" measurement: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) target: 99.95% # 21.9 min downtime/month window: rolling_30d error_budget: total_minutes: 21.9 # per 30 days burn_rate_alerts: - severity: critical burn_rate: 14.4x # Budget consumed in 2 hours short_window: 5m long_window: 1h - severity: warning burn_rate: 6x # Budget consumed in 5 days short_window: 30m long_window: 6h - severity: ticket burn_rate: 1x # Budget consumed in 30 days short_window: 6h long_window: 3d consequences: budget_remaining_above_50pct: "Normal development velocity" budget_remaining_20_to_50pct: "Prioritize reliability work" budget_remaining_below_20pct: "Feature freeze; reliability only" budget_exhausted: "All hands on reliability until budget recovers"

Common SLO Targets

Service TierAvailabilityp50 Latencyp99 LatencyMonthly DowntimeTier 0 (payments, auth)99.99%<100ms<500ms4.3 minTier 1 (core API)99.95%<200ms<1s21.9 minTier 2 (non-critical)99.9%<500ms<2s43.8 minTier 3 (internal tools)99.5%<1s<5s3.6 hoursBatch / pipeline99% (success rate)N/AN/AN/A

Error Budget Tracking

# Weekly error budget review template error_budget_review: week: "2026-W08" service: payment-api slo_target: 99.95% budget: total_minutes_this_period: 21.9 consumed_minutes: 8.2 remaining_minutes: 13.7 remaining_percent: 62.6% incidents_consuming_budget: - date: "2026-02-18" duration_minutes: 5.1 cause: "Database connection pool exhaustion" preventable: true action: "Increase pool size + add saturation alert" - date: "2026-02-20" duration_minutes: 3.1 cause: "Upstream payment provider timeout" preventable: false action: "Add circuit breaker with fallback" velocity_decision: "Normal โ€” 62.6% budget remaining" reliability_work_this_week: - "Add connection pool saturation alert" - "Implement circuit breaker for payment provider"

Alert Quality Principles

Every alert must be actionable โ€” if no one needs to act, it's not an alert Every alert needs a runbook โ€” linked directly in the alert annotation Symptom-based over cause-based โ€” alert on "users can't checkout" not "CPU high" Multi-window burn rate โ€” not static thresholds (see SLO alerts above) Alert on absence, not just presence โ€” "no orders in 15 min" catches silent failures

Alert Severity Levels

SeverityResponse TimeChannelWhoExampleP0 โ€” Critical<5 minPage (PagerDuty/Opsgenie)On-call engineerPayment system downP1 โ€” High<30 minPage during business hours, Slack 24/7On-callError rate >5% for 10 minP2 โ€” Medium<4 hoursSlack channelTeamp99 latency degraded 2xP3 โ€” LowNext business dayTicket auto-createdTeam backlogDisk usage >80%InfoN/ADashboard onlyNo oneDeploy completed

Alerting Anti-Patterns

Anti-PatternProblemFixStatic CPU/memory thresholdsNoisy, not user-impactingUse SLO-based burn rate alertsAlert per instance50 instances = 50 alerts for same issueAggregate: alert on service-level error rateNo deduplicationSame alert fires 100 timesGroup by service + alert name; set repeat intervalMissing runbookEngineer gets paged, doesn't know what to doEvery alert links to a runbookThreshold too sensitiveFires on brief spikesUse for: 5m to require sustained conditionToo many P0sAlert fatigue โ†’ ignoring real incidentsAudit monthly; demote or remove noisy alerts

Alert Template (Prometheus Alertmanager)

groups: - name: payment-api-slo rules: - alert: PaymentAPIHighErrorRate expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > 0.01 for: 5m labels: severity: critical service: payment-api team: payments annotations: summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)" description: "5xx error rate has exceeded 1% for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-errors" dashboard: "https://grafana.internal/d/payment-api" - alert: PaymentAPINoTraffic expr: | sum(rate(http_requests_total{service="payment-api"}[15m])) == 0 for: 5m labels: severity: critical service: payment-api annotations: summary: "Payment API receiving zero traffic for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-no-traffic" - alert: PaymentAPILatencyHigh expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le) ) > 2 for: 10m labels: severity: warning annotations: summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)" runbook: "https://wiki.internal/runbooks/payment-api-latency"

Runbook Template

  • # Runbook: PaymentAPIHighErrorRate
  • ## What This Alert Means
  • The payment API is returning >1% 5xx errors over a 5-minute window.
  • Users are likely failing to complete checkouts.
  • ## Impact
  • Users cannot process payments
  • Revenue loss: ~$X per minute (based on average traffic)
  • SLO: Payment API availability (target: 99.95%)
  • ## Immediate Actions
  • 1. Check the error dashboard: [link]
  • 2. Check recent deploys: `kubectl rollout history deployment/payment-api`
  • 3. Check upstream dependencies:
  • - Database: [dashboard link]
  • - Stripe API: [status page]
  • - Redis cache: [dashboard link]
  • 4. Check application logs:
  • kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'
  • ## Common Causes & Fixes
  • | Cause | Diagnosis | Fix |
  • |-------|-----------|-----|
  • | Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
  • | DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
  • | Stripe outage | Stripe status page red | Enable fallback payment processor |
  • | Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |
  • ## Escalation
  • If unresolved after 15 min: page payment team lead
  • If revenue impact >$10K: page VP Engineering
  • If Stripe outage: communicate to support team for customer messaging
  • ## Resolution
  • Confirm error rate <0.1% for 10 min
  • Post in #incidents: root cause + duration + impact
  • Schedule post-mortem if downtime >5 min

Dashboard Hierarchy

L1: Executive / Business Dashboard (non-technical stakeholders) โ†“ L2: Service Overview Dashboard (on-call, quick triage) โ†“ L3: Service Deep-Dive Dashboard (debugging specific service) โ†“ L4: Infrastructure Dashboard (resource-level details)

L1: Business Dashboard

panels: - title: "Revenue per Minute" type: stat query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)" - title: "Active Users (5min)" type: stat query: "count(count by (user_id) (http_requests_total{...}[5m]))" - title: "Checkout Success Rate" type: gauge query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))" thresholds: [95, 98, 99.5] - title: "Error Budget Remaining" type: gauge query: "1 - (error_budget_consumed / error_budget_total)"

L2: Service Overview Dashboard

Every service gets one of these with identical layout: row_1_traffic: - "Request Rate (rps)" โ€” timeseries, by status code - "Error Rate (%)" โ€” timeseries, threshold line at SLO - "Active Requests" โ€” gauge row_2_latency: - "Latency Distribution" โ€” heatmap - "p50 / p95 / p99" โ€” timeseries, threshold lines - "Latency by Endpoint" โ€” table, sorted by p99 row_3_dependencies: - "Downstream Latency" โ€” timeseries per dependency - "Downstream Error Rate" โ€” timeseries per dependency - "Database Query Duration" โ€” timeseries by query type row_4_resources: - "CPU Usage" โ€” timeseries per pod - "Memory Usage" โ€” timeseries per pod - "Pod Restarts" โ€” stat row_5_business: - "Business Metric 1" โ€” service-specific - "Business Metric 2" โ€” service-specific

Dashboard Rules

Time range default: last 1 hour โ€” most debugging happens in recent time Variable selectors at top: environment, service, instance Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards Link alerts to dashboards โ€” every alert annotation includes dashboard URL No more than 15 panels per dashboard โ€” split into L3 if needed Include "as of" timestamp โ€” so screenshots in incidents are unambiguous Dashboard as code โ€” store Grafana JSON in git, provision via API

Incident Severity Classification

SeverityCriteriaResponseCommunicationSEV-1Service down, data loss risk, security breachAll hands, war roomStatus page update every 15 minSEV-2Degraded service, SLO at risk, partial outageOn-call + backupStatus page update every 30 minSEV-3Minor degradation, workaround existsOn-call during hoursInternal Slack updateSEV-4Cosmetic, low impactNext sprintNone

Incident Roles

RoleResponsibilityWhoIncident Commander (IC)Owns the incident. Coordinates. Makes decisions.On-call leadTechnical LeadDiagnoses and fixes. Communicates technical status to IC.Senior engineerCommunications LeadUpdates status page, Slack, stakeholders.Product/supportScribeDocuments timeline, actions, decisions in real-time.Anyone available

Incident Response Workflow

1. DETECT - Alert fires โ†’ on-call paged - Customer report โ†’ support escalates - Internal discovery โ†’ engineer reports 2. TRIAGE (first 5 minutes) - Confirm the issue is real (not false alert) - Classify severity (SEV-1 through SEV-4) - Open incident channel: #inc-YYYY-MM-DD-short-description - Assign roles (IC, Tech Lead, Comms) 3. MITIGATE (next 5-30 minutes) - Goal: STOP THE BLEEDING, not find root cause - Options (try in order): a. Rollback last deploy b. Scale up / restart pods c. Toggle feature flag off d. Redirect traffic / enable fallback e. Manual data fix - Document every action with timestamp 4. STABILIZE - Confirm mitigation is working (metrics back to normal) - Monitor for 15-30 min for recurrence - Update status page: "Monitoring fix" 5. RESOLVE - Confirm all metrics healthy for 30+ min - Update status page: "Resolved" - Schedule post-mortem (within 48 hours for SEV-1/2) - Send internal summary to stakeholders

Incident Channel Template

๐Ÿ“‹ Incident: Payment API 5xx Errors ๐Ÿ”ด Severity: SEV-2 ๐Ÿ• Started: 2026-02-22 14:23 UTC ๐Ÿ‘ค IC: @alice ๐Ÿ”ง Tech Lead: @bob ๐Ÿ“ข Comms: @charlie Status: MITIGATING Impact: ~5% of checkout requests failing Customer-facing: Yes Timeline: 14:23 โ€” Alert fired: PaymentAPIHighErrorRate 14:25 โ€” IC assigned: @alice, confirmed real via dashboard 14:28 โ€” Tech Lead: error logs show connection pool exhaustion post-deploy 14:31 โ€” Rolled back deployment v2.3.1 โ†’ v2.3.0 14:35 โ€” Error rate dropping, monitoring 14:50 โ€” Error rate <0.1%, marking resolved

Blameless Post-Mortem Template

post_mortem: title: "Payment API Connection Pool Exhaustion" date: "2026-02-22" severity: SEV-2 duration: 27 minutes (14:23 โ€” 14:50 UTC) authors: ["@alice", "@bob"] reviewers: ["@engineering-leads"] status: action_items_in_progress summary: | A deployment at 14:15 introduced a connection leak in the payment API. Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of checkout requests. Rolled back at 14:31; recovered by 14:50. impact: user_impact: "~340 users saw checkout failures over 27 minutes" revenue_impact: "$2,100 estimated (based on average order value ร— failed checkouts)" slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)" data_impact: "No data loss. 12 orders failed; users could retry successfully." timeline: - time: "14:15" event: "Deploy v2.3.1 rolled out (3/3 pods updated)" - time: "14:23" event: "PaymentAPIHighErrorRate alert fired" - time: "14:25" event: "IC assigned, confirmed via dashboard" - time: "14:28" event: "Root cause identified: new ORM query not releasing connections" - time: "14:31" event: "Rollback initiated: v2.3.1 โ†’ v2.3.0" - time: "14:35" event: "Error rate declining" - time: "14:50" event: "Resolved: error rate <0.1% sustained" root_cause: | The v2.3.1 deploy introduced a new database query in the order validation path. The query used a raw connection instead of the pool's managed client, so connections were acquired but never released. Under load, the pool exhausted within 8 minutes. contributing_factors: - "No integration test for connection pool behavior under load" - "Connection pool saturation metric existed but had no alert" - "Code review didn't catch raw connection usage" what_went_well: - "Alert fired within 8 minutes of deploy" - "IC assigned in 2 minutes" - "Root cause identified in 3 minutes (clear in logs)" - "Rollback executed cleanly" what_went_wrong: - "8-minute detection gap after deploy" - "No canary deployment to catch before full rollout" - "Connection pool saturation had no alert" action_items: - action: "Add connection pool saturation alert (>80% for 2 min)" owner: "@bob" priority: P1 due: "2026-02-25" status: in_progress ticket: "ENG-1234" - action: "Enable canary deployments for payment-api" owner: "@alice" priority: P1 due: "2026-03-01" ticket: "ENG-1235" - action: "Add linting rule: no raw DB connections in application code" owner: "@charlie" priority: P2 due: "2026-03-07" ticket: "ENG-1236" - action: "Load test payment-api connection pool in staging" owner: "@bob" priority: P2 due: "2026-03-07" ticket: "ENG-1237" lessons_learned: - "Resource saturation metrics need alerts, not just dashboards" - "Canary deployments are mandatory for Tier 0 services" - "ORM abstractions don't guarantee connection safety โ€” review raw queries"

Post-Mortem Meeting Agenda (60 minutes)

1. (5 min) Context setting โ€” IC reads the summary 2. (15 min) Timeline walkthrough โ€” what happened, when, by whom 3. (15 min) Root cause deep-dive โ€” 5 Whys exercise 4. (5 min) What went well โ€” celebrate good response 5. (15 min) Action items โ€” assign owners, priorities, due dates 6. (5 min) Wrap-up โ€” review date for action item check-in

5 Whys Exercise

Problem: 5xx errors in payment API Why 1: Database connections were exhausted Why 2: A new query acquired connections without releasing them Why 3: The query used a raw connection instead of the pool manager Why 4: The ORM's raw query API doesn't auto-release (by design) Why 5: We don't have a linting rule or code review checklist item for this Root cause: Missing guard against raw connection usage in application code Systemic fix: Linting rule + connection pool saturation alerting

On-Call Structure

on_call: rotation: weekly handoff_day: Monday 10:00 UTC primary: response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3) escalation_after: 15 minutes no-ack secondary: response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3) escalation_after: 30 minutes no-ack manager_escalation: trigger: SEV-1 unresolved after 30 minutes handoff_checklist: - Review open incidents and active alerts - Check error budget status for all services - Read post-mortems from previous week - Verify PagerDuty schedule and contact info - Test alert routing (send test page)

On-Call Health Metrics

MetricHealthyNeeds AttentionUnhealthyPages per week<55-15>15After-hours pages per week<22-5>5False positive rate<10%10-30%>30%Mean time to acknowledge<5 min5-15 min>15 minMean time to resolve<30 min30-120 min>120 minToil ratio (manual vs automated)<30%30-60%>60%

Weekly On-Call Review Template

on_call_review: week: "2026-W08" engineer: "@bob" incidents: total: 7 sev_1: 0 sev_2: 1 sev_3: 4 false_positives: 2 after_hours: 3 time_spent: incident_response: "4.5 hours" toil_automation: "2 hours" runbook_updates: "1 hour" improvements_made: - "Silenced noisy disk alert on dev servers" - "Added auto-remediation for pod restart threshold" improvements_needed: - "Cache expiry alert fires every Tuesday at 03:00 โ€” needs investigation" - "Payment retry logic needs circuit breaker (caused 3 alerts)" handoff_notes: | Watch payment-api p99 latency โ€” it's been creeping up since Wednesday. Stripe changed their sandbox endpoints; staging may throw errors.

Chaos Principles

Start with a hypothesis: "If X fails, the system should Y" Run in production (start small โ€” one instance, one AZ) Minimize blast radius with automatic rollback Build confidence incrementally: staging โ†’ canary โ†’ production

Chaos Experiment Template

chaos_experiment: name: "Payment DB failover" hypothesis: "If the primary database becomes unavailable, traffic should failover to the replica within 30 seconds with <1% error rate spike" steady_state: - metric: "checkout_success_rate" expected: ">99.5%" - metric: "db_query_duration_p99" expected: "<200ms" injection: type: "network_partition" target: "payment-db-primary" duration: "5 minutes" blast_radius: "single AZ" abort_conditions: - "checkout_success_rate < 95% for > 60 seconds" - "revenue_per_minute drops > 50%" - "any SEV-1 incident declared" results: failover_time: "22 seconds" error_spike: "0.3% for 25 seconds" hypothesis_confirmed: true follow_up_actions: - "Document failover behavior in runbook" - "Add failover time as SLI (target: <30s)"

Chaos Engineering Maturity Levels

LevelWhat You TestTools1: ManualKill a pod, see what happenskubectl delete pod2: AutomatedScheduled pod kills, network delaysChaos Monkey, Litmus3: Game DaysMulti-failure scenarios with team exerciseCustom scripts + coordination4: ContinuousAutomated chaos in production with auto-rollbackGremlin, Chaos Mesh

Cost Drivers (Ranked)

#DriverTypical % of BillOptimization1Log volume40-60%Reduce verbosity, drop DEBUG, sample repetitive2Metric cardinality15-25%Drop unused metrics, limit labels3Trace volume10-20%Sampling, tail-based sampling4Retention10-15%Tiered storage (hot โ†’ warm โ†’ cold)5Query cost5-10%Optimize dashboard queries, set max scan limits

Cost Reduction Checklist

cost_optimization: logs: - action: "Drop DEBUG/TRACE in production" savings: "30-50% of log volume" - action: "Sample health check logs (1:100)" savings: "5-15% of log volume" - action: "Deduplicate identical error bursts" savings: "10-20% during incidents" - action: "Move logs older than 7 days to S3/cold storage" savings: "60-80% of storage cost" - action: "Drop request/response body logging" savings: "20-40% of log volume" metrics: - action: "Audit unused metrics (no dashboard, no alert)" savings: "10-30% of series" - action: "Reduce histogram bucket count (default 11 โ†’ 8)" savings: "~27% of histogram series" - action: "Remove high-cardinality labels" savings: "Variable โ€” can be massive" - action: "Increase scrape interval for non-critical metrics (15s โ†’ 60s)" savings: "75% of data points for those metrics" traces: - action: "Implement tail-based sampling" savings: "80-95% of trace volume" - action: "Drop internal health check traces" savings: "5-20% of trace volume" - action: "Reduce span attribute size (truncate long strings)" savings: "10-30% of trace storage" general: - action: "Review and right-size retention policies quarterly" - action: "Set query timeouts and result limits on dashboards" - action: "Use recording rules for expensive queries"

Monthly Cost Review Template

observability_cost_review: month: "February 2026" total_cost: "$X,XXX" breakdown: logs: { volume: "X TB", cost: "$X", pct: "X%" } metrics: { series: "X million", cost: "$X", pct: "X%" } traces: { volume: "X TB", cost: "$X", pct: "X%" } infrastructure: { instances: X, cost: "$X", pct: "X%" } cost_per: request: "$0.000X" service: "$X average" engineer: "$X per engineer" optimizations_applied: [] optimizations_planned: [] budget_status: "on_track | over_budget | under_budget"

Correlation: Connecting the Three Pillars

Every log line includes: trace_id, span_id Every trace span includes: service, operation Every metric includes: service label Correlation paths: Alert fires (metric) โ†’ Click โ†’ Dashboard (metric) โ†’ Filter by time window โ†’ Trace search (same service + time) โ†’ Find failing trace โ†’ Logs (filter by trace_id) โ†’ See exact error Support ticket (user report) โ†’ Find request_id in logs โ†’ Extract trace_id โ†’ View full trace โ†’ Identify slow span โ†’ Check span's service metrics โ†’ Confirm pattern

Synthetic Monitoring

synthetic_checks: - name: "Checkout flow" type: browser frequency: 5m locations: [us-east, eu-west, ap-southeast] steps: - navigate: "https://app.example.com/products" - click: "Add to Cart" - click: "Checkout" - assert: "Order confirmation page loads in <3s" alert_on: "2 consecutive failures from same location" - name: "API health" type: api frequency: 1m endpoints: - url: "https://api.example.com/health" expected_status: 200 max_latency_ms: 500 - url: "https://api.example.com/v1/products?limit=1" expected_status: 200 max_latency_ms: 1000

Feature Flag Observability

# Correlate feature flags with metrics feature_flag_monitoring: - flag: "new_checkout_flow" metrics_to_compare: - "checkout_conversion_rate" # by flag variant - "checkout_error_rate" - "checkout_latency_p99" alerts: - "If error rate for new variant > 2x control, auto-disable flag"

Observability Maturity Model

DimensionLevel 1Level 2Level 3Level 4LoggingUnstructured logsStructured JSON, centralizedCorrelated with tracesAutomated log analysisMetricsBasic infra metricsRED/USE for servicesSLO-based with error budgetsPredictive (anomaly detection)TracingNo tracingKey services instrumentedFull distributed tracingTrace-driven testingAlertingStatic thresholdsMulti-signal alertsBurn-rate based on SLOsAuto-remediationIncident ResponseAd hocDefined process + rolesPost-mortems with action trackingChaos engineering in prodCulture"Ops team handles it"Shared ownership (you build it, you run it)SLO-driven development velocityReliability as a feature

Quality Scoring Rubric (0-100)

DimensionWeight0510Logging quality15%Unstructured, no correlationStructured JSON, missing fieldsFull schema, trace correlation, PII scrubbingMetrics coverage15%No metricsRED or USE, not bothRED + USE + business metrics + customTracing completeness10%No tracingKey servicesFull path, sampling strategy, tail-basedSLO maturity15%No reliability targetsInformal targetsSLOs with error budgets, burn-rate alerts, weekly reviewAlert quality15%Noisy/missingActionable, some runbooksSLO-based, full runbooks, low false positiveIncident response10%Ad hocDefined processFull process, roles, post-mortems, chaos engineeringDashboard design10%No dashboardsBasic panelsHierarchical L1-L4, consistent, linked to alertsCost efficiency10%Unknown costTrackedOptimized, reviewed monthly, within budget 90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.

10 Observability Commandments

Structured or it didn't happen โ€” unstructured logs are technical debt Correlate everything โ€” trace_id connects logs, traces, and metrics Alert on symptoms, not causes โ€” users don't care about CPU, they care about latency Every alert gets a runbook โ€” no runbook = no alert SLOs drive velocity โ€” error budgets decide when to ship vs stabilize Dashboards have hierarchy โ€” executives don't need pod CPU graphs Blameless post-mortems always โ€” blame prevents learning Cost is a feature โ€” observability that bankrupts you isn't observability You build it, you run it โ€” the team that ships code owns its observability Practice failure โ€” chaos engineering builds confidence

12 Natural Language Commands

CommandWhat It Does"Audit our observability"Run the /16 health check, score each dimension, prioritize gaps"Design logging for [service]"Generate structured log schema with context fields for the service"Set up metrics for [service]"Create RED + USE + business metric instrumentation plan"Create SLOs for [service]"Define SLIs, targets, error budgets, and burn-rate alert rules"Design alerts for [service]"Create alert rules with severity, thresholds, and runbook templates"Build dashboard for [service]"Design L2 service overview dashboard with panel specifications"Write a runbook for [alert]"Generate structured runbook with diagnosis steps and fixes"Run post-mortem for [incident]"Generate blameless post-mortem document with timeline and action items"Set up on-call for [team]"Design rotation, escalation policy, handoff checklist"Plan chaos experiment for [scenario]"Design experiment with hypothesis, injection, abort conditions"Optimize observability costs"Audit current spend, identify top savings, create reduction plan"Design tracing for [system]"Create OpenTelemetry instrumentation plan with sampling strategy

โšก Level Up Your Observability

This skill gives you the methodology. For industry-specific implementation patterns: SaaS companies: AfrexAI SaaS Context Pack ($47) โ€” includes SaaS-specific SLOs, multi-tenant monitoring, and usage-based billing observability Fintech: AfrexAI Fintech Context Pack ($47) โ€” compliance audit logging, transaction monitoring, fraud detection signals Healthcare: AfrexAI Healthcare Context Pack ($47) โ€” HIPAA audit trails, PHI access logging, uptime requirements

๐Ÿ”— More Free Skills by AfrexAI

afrexai-devops-engine โ€” CI/CD, infrastructure, deployment strategies afrexai-api-architect โ€” API design, security, versioning afrexai-database-engineering โ€” Schema design, query optimization, migrations afrexai-code-reviewer โ€” Code review methodology with SPEAR framework afrexai-prompt-engineering โ€” System prompt design, testing, optimization Browse all AfrexAI skills: clawhub.com | Full storefront

Category context

Long-tail utilities that do not fit the current primary taxonomy cleanly.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
2 Docs
  • SKILL.md Primary doc
  • README.md Docs