Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, buil...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Complete system for building observable, reliable services โ from structured logging to incident response to SLO-driven development.
Score your current observability posture: SignalHealthy (2)Weak (1)Missing (0)Structured loggingJSON logs with trace_id correlationLogs exist but unstructuredConsole.log / print statementsMetrics collectionRED/USE metrics with dashboardsSome metrics, no dashboardsNo metricsDistributed tracingFull request path with samplingPartial traces, key services onlyNo tracingAlertingSLO-based alerts with runbooksThreshold alerts, some runbooksNo alerts or all-noiseIncident responseDefined process with roles + post-mortemsAd-hoc response, some docs"Whoever notices fixes it"SLOs definedSLOs with error budgets tracked weeklyInformal availability targetsNo reliability targetsOn-call rotationStructured rotation with escalationInformal "call someone"No on-callCost managementObservability budget tracked monthlySome awareness of costsNo idea what you spend 12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.
Application โ Structured JSON โ Log Router โ Storage โ Query Engine โ Alert Pipeline
FieldTypePurposeExampletimestampISO-8601 UTCWhen2026-02-22T18:30:00.123ZlevelenumSeverityinfo, warn, error, fatalservicestringWhich servicepayment-apiversionstringWhich deployv2.3.1environmentstringWhich envproductionmessagestringWhat happenedPayment processed successfullytrace_idstringRequest correlationabc123def456span_idstringOperation within tracespan_789duration_msnumberHow long142
# HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123" # Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD" # Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true
Is the process about to crash? โ FATAL (exit after logging) Did an operation fail that needs human attention? โ ERROR (page someone or create ticket) Did something unexpected happen but we recovered? โ WARN (review in daily triage) Is this a normal business event worth recording? โ INFO (audit trail, business metrics) Is this useful for debugging but noisy in production? โ DEBUG (off in prod, on in staging) Is this only useful when stepping through code? โ TRACE (never in production)
ERROR means action required โ if no one needs to act on it, it's WARN INFO is for business events โ not internal implementation details No logging inside tight loops โ aggregate and log summary Log at boundaries โ API entry/exit, queue consume/publish, DB calls Never log secrets โ API keys, tokens, passwords, PII (see scrubbing below)
scrub_patterns: # Always redact - field_patterns: ["password", "secret", "token", "api_key", "authorization"] action: replace_with_redacted # Hash for correlation without exposure - field_patterns: ["email", "phone", "ssn", "national_id"] action: sha256_hash # Mask partially - field_patterns: ["credit_card", "card_number"] action: mask_last_4 # "****-****-****-1234" # IP anonymization - field_patterns: ["client_ip", "ip_address"] action: zero_last_octet # 203.0.113.0
Node.js (Pino): import pino from 'pino'; import { AsyncLocalStorage } from 'node:async_hooks'; const als = new AsyncLocalStorage<Record<string, string>>(); const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, mixin: () => als.getStore() ?? {}, redact: ['req.headers.authorization', '*.password', '*.token'], timestamp: pino.stdTimeFunctions.isoTime, }); // Middleware: inject context app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), service: 'payment-api', version: process.env.APP_VERSION, }; als.run(ctx, () => next()); }); Python (structlog): import structlog structlog.configure( processors=[ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso", utc=True), structlog.processors.JSONRenderer(), ], ) log = structlog.get_logger() # Bind context per-request: structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id) Go (zerolog): log := zerolog.New(os.Stdout).With(). Timestamp(). Str("service", "payment-api"). Str("version", version). Logger() // Per-request: reqLog := log.With().Str("trace_id", traceID).Logger()
VolumeSolutionRetentionCost<10 GB/dayLoki + Grafana30 days hot, 90 days coldLow10-100 GB/dayElasticsearch / OpenSearch14 days hot, 90 days S3Medium100+ GB/dayClickHouse or Datadog7 days hot, 30 days archiveHighBudget-constrainedLoki + S3 backend90 days all coldVery low
#Anti-PatternFix1log.error(err) with no contextAlways include: what operation, what input, what state2Logging request/response bodiesLog only in DEBUG; redact sensitive fields3String concatenation in log messagesUse structured fields: log.info("processed", { order_id, amount })4Catch-and-log-and-rethrowLog at the boundary where you handle it, not every layer5Different log formats per serviceStandardize schema across all services6No log rotation / retention policySet max size + TTL; archive to cold storage7Logging inside hot pathsAggregate: log summary every N items or every interval8Missing correlation IDsPropagate trace_id from first entry point through all services9Boolean log levels (verbose: true)Use standard levels with configurable minimum10Logging PII in plain textImplement scrubbing at the logger level
For every service endpoint, track: MetricWhatPrometheus ExampleRateRequests per secondhttp_requests_total{method, path, status}ErrorsFailed requests per secondhttp_requests_total{status=~"5.."} / totalDurationLatency distributionhttp_request_duration_seconds{method, path} (histogram)
For every resource (CPU, memory, disk, network): MetricWhatExampleUtilization% resource busyCPU usage 78%SaturationQueue depth / backpressure12 requests queuedErrorsResource errors3 disk I/O errors
SignalMeaningSourceLatencyTime to serve requestsRED DurationTrafficDemand on the systemRED RateErrorsRate of failed requestsRED ErrorsSaturationHow "full" the service isUSE Saturation
TypeUse CaseExampleCounterThings that only go upTotal requests, errors, bytes sentGaugeCurrent value that goes up/downActive connections, queue depth, temperatureHistogramDistribution of valuesRequest latency, response sizeSummaryPre-calculated percentilesClient-side latency (when you need exact percentiles) Rule: Use histograms over summaries in most cases โ they're aggregatable across instances.
# Pattern: <namespace>_<subsystem>_<name>_<unit> http_server_request_duration_seconds http_server_requests_total db_pool_connections_active queue_messages_pending cache_hit_ratio # Rules: # 1. Use snake_case # 2. Include unit suffix (_seconds, _bytes, _total) # 3. _total suffix for counters # 4. Don't include label names in metric name # 5. Use base units (seconds not milliseconds, bytes not kilobytes)
RuleWhyExampleKeep cardinality <100 per labelHigh cardinality kills performancestatus="200" not status="200 OK"No user IDs as labelsUnbounded cardinalityUse log correlation insteadNo request paths with IDs/api/users/123 creates millions of seriesNormalize: /api/users/:idMax 5-7 labels per metricEach combo = a time series{method, path, status, service}
application_metrics: # HTTP layer - http_request_duration_seconds: histogram {method, path, status} - http_request_size_bytes: histogram {method, path} - http_response_size_bytes: histogram {method, path} - http_requests_in_flight: gauge # Business logic - orders_processed_total: counter {status, payment_method} - order_value_dollars: histogram {payment_method} - user_signups_total: counter {source} # Dependencies - db_query_duration_seconds: histogram {query_type, table} - db_connections_active: gauge {pool} - db_connections_idle: gauge {pool} - cache_requests_total: counter {result: hit|miss} - external_api_duration_seconds: histogram {service, endpoint} - external_api_errors_total: counter {service, error_type} # Queue / async - queue_messages_published_total: counter {queue} - queue_messages_consumed_total: counter {queue, status} - queue_processing_duration_seconds: histogram {queue} - queue_depth: gauge {queue} - queue_consumer_lag: gauge {queue, consumer_group} infrastructure_metrics: # Node exporter / cAdvisor provides these automatically - cpu_usage_percent: gauge {instance} - memory_usage_bytes: gauge {instance} - disk_usage_bytes: gauge {instance, mount} - disk_io_seconds: counter {instance, device} - network_bytes: counter {instance, direction} - container_cpu_usage: gauge {pod, container} - container_memory_usage: gauge {pod, container}
ComponentOptionsRecommendationCollectionPrometheus, OTEL Collector, Datadog AgentPrometheus (free) or OTEL Collector (vendor-neutral)StoragePrometheus, Thanos, Mimir, VictoriaMetricsVictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem)VisualizationGrafana, Datadog, New RelicGrafana (free, extensible)AlertingAlertmanager, Grafana Alerting, PagerDutyAlertmanager + PagerDuty routing
Client Request โ API Gateway (root span) โ Auth Service (child span) โ Order Service (child span) โ Database Query (child span) โ Payment Service (child span) โ Stripe API (child span) โ Notification Service (child span) โ Email Provider (child span)
Auto-instrumentation (Node.js): // tracing.ts โ import BEFORE anything else import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] }, '@opentelemetry/instrumentation-express': { enabled: true }, })], serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api', }); sdk.start(); Custom spans for business logic: import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service'); async function processPayment(order: Order) { return tracer.startActiveSpan('process-payment', async (span) => { span.setAttributes({ 'order.id': order.id, 'order.amount_cents': order.amountCents, 'payment.method': order.paymentMethod, }); try { const result = await chargeCard(order); span.setAttributes({ 'payment.status': result.status }); return result; } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); }
StrategyWhenConfigAlways OnDev/staging, low traffic (<100 rps)ratio: 1.0ProbabilisticModerate traffic (100-1000 rps)ratio: 0.1 (10%)Rate-limitedHigh traffic (>1000 rps)max_traces_per_second: 100Tail-basedWant all errors + slow requestsCollector-side: keep if error OR duration > p99Parent-basedRespect upstream decisionsIf parent sampled, child sampled Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.
HeaderStandardFormattraceparentW3C Trace Context00-{trace_id}-{span_id}-{flags}tracestateW3C Trace ContextVendor-specific key-value pairsb3Zipkin B3{trace_id}-{span_id}-{sampled} Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.
VolumeSolutionRetention<50 GB/dayJaeger + Elasticsearch7 days50-500 GB/dayTempo + S314 days500+ GB/dayTempo + S3 with aggressive sampling7 daysBudget-constrainedJaeger + Badger (local disk)3 days
Service TypePrimary SLISecondary SLIMeasurementAPI / WebAvailability + LatencyError rateServer-side + syntheticData pipelineFreshness + CorrectnessThroughputPipeline timestamps + checksumsStorageDurability + AvailabilityLatencyChecksums + uptime monitoringStreamingThroughput + LatencyMessage loss rateConsumer lag + e2e latencyBatch jobsSuccess rate + FreshnessDurationJob scheduler metrics
slo: name: "Payment API Availability" service: payment-api owner: payments-team sli: type: availability definition: "Proportion of non-5xx responses" measurement: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) target: 99.95% # 21.9 min downtime/month window: rolling_30d error_budget: total_minutes: 21.9 # per 30 days burn_rate_alerts: - severity: critical burn_rate: 14.4x # Budget consumed in 2 hours short_window: 5m long_window: 1h - severity: warning burn_rate: 6x # Budget consumed in 5 days short_window: 30m long_window: 6h - severity: ticket burn_rate: 1x # Budget consumed in 30 days short_window: 6h long_window: 3d consequences: budget_remaining_above_50pct: "Normal development velocity" budget_remaining_20_to_50pct: "Prioritize reliability work" budget_remaining_below_20pct: "Feature freeze; reliability only" budget_exhausted: "All hands on reliability until budget recovers"
Service TierAvailabilityp50 Latencyp99 LatencyMonthly DowntimeTier 0 (payments, auth)99.99%<100ms<500ms4.3 minTier 1 (core API)99.95%<200ms<1s21.9 minTier 2 (non-critical)99.9%<500ms<2s43.8 minTier 3 (internal tools)99.5%<1s<5s3.6 hoursBatch / pipeline99% (success rate)N/AN/AN/A
# Weekly error budget review template error_budget_review: week: "2026-W08" service: payment-api slo_target: 99.95% budget: total_minutes_this_period: 21.9 consumed_minutes: 8.2 remaining_minutes: 13.7 remaining_percent: 62.6% incidents_consuming_budget: - date: "2026-02-18" duration_minutes: 5.1 cause: "Database connection pool exhaustion" preventable: true action: "Increase pool size + add saturation alert" - date: "2026-02-20" duration_minutes: 3.1 cause: "Upstream payment provider timeout" preventable: false action: "Add circuit breaker with fallback" velocity_decision: "Normal โ 62.6% budget remaining" reliability_work_this_week: - "Add connection pool saturation alert" - "Implement circuit breaker for payment provider"
Every alert must be actionable โ if no one needs to act, it's not an alert Every alert needs a runbook โ linked directly in the alert annotation Symptom-based over cause-based โ alert on "users can't checkout" not "CPU high" Multi-window burn rate โ not static thresholds (see SLO alerts above) Alert on absence, not just presence โ "no orders in 15 min" catches silent failures
SeverityResponse TimeChannelWhoExampleP0 โ Critical<5 minPage (PagerDuty/Opsgenie)On-call engineerPayment system downP1 โ High<30 minPage during business hours, Slack 24/7On-callError rate >5% for 10 minP2 โ Medium<4 hoursSlack channelTeamp99 latency degraded 2xP3 โ LowNext business dayTicket auto-createdTeam backlogDisk usage >80%InfoN/ADashboard onlyNo oneDeploy completed
Anti-PatternProblemFixStatic CPU/memory thresholdsNoisy, not user-impactingUse SLO-based burn rate alertsAlert per instance50 instances = 50 alerts for same issueAggregate: alert on service-level error rateNo deduplicationSame alert fires 100 timesGroup by service + alert name; set repeat intervalMissing runbookEngineer gets paged, doesn't know what to doEvery alert links to a runbookThreshold too sensitiveFires on brief spikesUse for: 5m to require sustained conditionToo many P0sAlert fatigue โ ignoring real incidentsAudit monthly; demote or remove noisy alerts
groups: - name: payment-api-slo rules: - alert: PaymentAPIHighErrorRate expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > 0.01 for: 5m labels: severity: critical service: payment-api team: payments annotations: summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)" description: "5xx error rate has exceeded 1% for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-errors" dashboard: "https://grafana.internal/d/payment-api" - alert: PaymentAPINoTraffic expr: | sum(rate(http_requests_total{service="payment-api"}[15m])) == 0 for: 5m labels: severity: critical service: payment-api annotations: summary: "Payment API receiving zero traffic for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-no-traffic" - alert: PaymentAPILatencyHigh expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le) ) > 2 for: 10m labels: severity: warning annotations: summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)" runbook: "https://wiki.internal/runbooks/payment-api-latency"
L1: Executive / Business Dashboard (non-technical stakeholders) โ L2: Service Overview Dashboard (on-call, quick triage) โ L3: Service Deep-Dive Dashboard (debugging specific service) โ L4: Infrastructure Dashboard (resource-level details)
panels: - title: "Revenue per Minute" type: stat query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)" - title: "Active Users (5min)" type: stat query: "count(count by (user_id) (http_requests_total{...}[5m]))" - title: "Checkout Success Rate" type: gauge query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))" thresholds: [95, 98, 99.5] - title: "Error Budget Remaining" type: gauge query: "1 - (error_budget_consumed / error_budget_total)"
Every service gets one of these with identical layout: row_1_traffic: - "Request Rate (rps)" โ timeseries, by status code - "Error Rate (%)" โ timeseries, threshold line at SLO - "Active Requests" โ gauge row_2_latency: - "Latency Distribution" โ heatmap - "p50 / p95 / p99" โ timeseries, threshold lines - "Latency by Endpoint" โ table, sorted by p99 row_3_dependencies: - "Downstream Latency" โ timeseries per dependency - "Downstream Error Rate" โ timeseries per dependency - "Database Query Duration" โ timeseries by query type row_4_resources: - "CPU Usage" โ timeseries per pod - "Memory Usage" โ timeseries per pod - "Pod Restarts" โ stat row_5_business: - "Business Metric 1" โ service-specific - "Business Metric 2" โ service-specific
Time range default: last 1 hour โ most debugging happens in recent time Variable selectors at top: environment, service, instance Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards Link alerts to dashboards โ every alert annotation includes dashboard URL No more than 15 panels per dashboard โ split into L3 if needed Include "as of" timestamp โ so screenshots in incidents are unambiguous Dashboard as code โ store Grafana JSON in git, provision via API
SeverityCriteriaResponseCommunicationSEV-1Service down, data loss risk, security breachAll hands, war roomStatus page update every 15 minSEV-2Degraded service, SLO at risk, partial outageOn-call + backupStatus page update every 30 minSEV-3Minor degradation, workaround existsOn-call during hoursInternal Slack updateSEV-4Cosmetic, low impactNext sprintNone
RoleResponsibilityWhoIncident Commander (IC)Owns the incident. Coordinates. Makes decisions.On-call leadTechnical LeadDiagnoses and fixes. Communicates technical status to IC.Senior engineerCommunications LeadUpdates status page, Slack, stakeholders.Product/supportScribeDocuments timeline, actions, decisions in real-time.Anyone available
1. DETECT - Alert fires โ on-call paged - Customer report โ support escalates - Internal discovery โ engineer reports 2. TRIAGE (first 5 minutes) - Confirm the issue is real (not false alert) - Classify severity (SEV-1 through SEV-4) - Open incident channel: #inc-YYYY-MM-DD-short-description - Assign roles (IC, Tech Lead, Comms) 3. MITIGATE (next 5-30 minutes) - Goal: STOP THE BLEEDING, not find root cause - Options (try in order): a. Rollback last deploy b. Scale up / restart pods c. Toggle feature flag off d. Redirect traffic / enable fallback e. Manual data fix - Document every action with timestamp 4. STABILIZE - Confirm mitigation is working (metrics back to normal) - Monitor for 15-30 min for recurrence - Update status page: "Monitoring fix" 5. RESOLVE - Confirm all metrics healthy for 30+ min - Update status page: "Resolved" - Schedule post-mortem (within 48 hours for SEV-1/2) - Send internal summary to stakeholders
๐ Incident: Payment API 5xx Errors ๐ด Severity: SEV-2 ๐ Started: 2026-02-22 14:23 UTC ๐ค IC: @alice ๐ง Tech Lead: @bob ๐ข Comms: @charlie Status: MITIGATING Impact: ~5% of checkout requests failing Customer-facing: Yes Timeline: 14:23 โ Alert fired: PaymentAPIHighErrorRate 14:25 โ IC assigned: @alice, confirmed real via dashboard 14:28 โ Tech Lead: error logs show connection pool exhaustion post-deploy 14:31 โ Rolled back deployment v2.3.1 โ v2.3.0 14:35 โ Error rate dropping, monitoring 14:50 โ Error rate <0.1%, marking resolved
post_mortem: title: "Payment API Connection Pool Exhaustion" date: "2026-02-22" severity: SEV-2 duration: 27 minutes (14:23 โ 14:50 UTC) authors: ["@alice", "@bob"] reviewers: ["@engineering-leads"] status: action_items_in_progress summary: | A deployment at 14:15 introduced a connection leak in the payment API. Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of checkout requests. Rolled back at 14:31; recovered by 14:50. impact: user_impact: "~340 users saw checkout failures over 27 minutes" revenue_impact: "$2,100 estimated (based on average order value ร failed checkouts)" slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)" data_impact: "No data loss. 12 orders failed; users could retry successfully." timeline: - time: "14:15" event: "Deploy v2.3.1 rolled out (3/3 pods updated)" - time: "14:23" event: "PaymentAPIHighErrorRate alert fired" - time: "14:25" event: "IC assigned, confirmed via dashboard" - time: "14:28" event: "Root cause identified: new ORM query not releasing connections" - time: "14:31" event: "Rollback initiated: v2.3.1 โ v2.3.0" - time: "14:35" event: "Error rate declining" - time: "14:50" event: "Resolved: error rate <0.1% sustained" root_cause: | The v2.3.1 deploy introduced a new database query in the order validation path. The query used a raw connection instead of the pool's managed client, so connections were acquired but never released. Under load, the pool exhausted within 8 minutes. contributing_factors: - "No integration test for connection pool behavior under load" - "Connection pool saturation metric existed but had no alert" - "Code review didn't catch raw connection usage" what_went_well: - "Alert fired within 8 minutes of deploy" - "IC assigned in 2 minutes" - "Root cause identified in 3 minutes (clear in logs)" - "Rollback executed cleanly" what_went_wrong: - "8-minute detection gap after deploy" - "No canary deployment to catch before full rollout" - "Connection pool saturation had no alert" action_items: - action: "Add connection pool saturation alert (>80% for 2 min)" owner: "@bob" priority: P1 due: "2026-02-25" status: in_progress ticket: "ENG-1234" - action: "Enable canary deployments for payment-api" owner: "@alice" priority: P1 due: "2026-03-01" ticket: "ENG-1235" - action: "Add linting rule: no raw DB connections in application code" owner: "@charlie" priority: P2 due: "2026-03-07" ticket: "ENG-1236" - action: "Load test payment-api connection pool in staging" owner: "@bob" priority: P2 due: "2026-03-07" ticket: "ENG-1237" lessons_learned: - "Resource saturation metrics need alerts, not just dashboards" - "Canary deployments are mandatory for Tier 0 services" - "ORM abstractions don't guarantee connection safety โ review raw queries"
1. (5 min) Context setting โ IC reads the summary 2. (15 min) Timeline walkthrough โ what happened, when, by whom 3. (15 min) Root cause deep-dive โ 5 Whys exercise 4. (5 min) What went well โ celebrate good response 5. (15 min) Action items โ assign owners, priorities, due dates 6. (5 min) Wrap-up โ review date for action item check-in
Problem: 5xx errors in payment API Why 1: Database connections were exhausted Why 2: A new query acquired connections without releasing them Why 3: The query used a raw connection instead of the pool manager Why 4: The ORM's raw query API doesn't auto-release (by design) Why 5: We don't have a linting rule or code review checklist item for this Root cause: Missing guard against raw connection usage in application code Systemic fix: Linting rule + connection pool saturation alerting
on_call: rotation: weekly handoff_day: Monday 10:00 UTC primary: response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3) escalation_after: 15 minutes no-ack secondary: response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3) escalation_after: 30 minutes no-ack manager_escalation: trigger: SEV-1 unresolved after 30 minutes handoff_checklist: - Review open incidents and active alerts - Check error budget status for all services - Read post-mortems from previous week - Verify PagerDuty schedule and contact info - Test alert routing (send test page)
MetricHealthyNeeds AttentionUnhealthyPages per week<55-15>15After-hours pages per week<22-5>5False positive rate<10%10-30%>30%Mean time to acknowledge<5 min5-15 min>15 minMean time to resolve<30 min30-120 min>120 minToil ratio (manual vs automated)<30%30-60%>60%
on_call_review: week: "2026-W08" engineer: "@bob" incidents: total: 7 sev_1: 0 sev_2: 1 sev_3: 4 false_positives: 2 after_hours: 3 time_spent: incident_response: "4.5 hours" toil_automation: "2 hours" runbook_updates: "1 hour" improvements_made: - "Silenced noisy disk alert on dev servers" - "Added auto-remediation for pod restart threshold" improvements_needed: - "Cache expiry alert fires every Tuesday at 03:00 โ needs investigation" - "Payment retry logic needs circuit breaker (caused 3 alerts)" handoff_notes: | Watch payment-api p99 latency โ it's been creeping up since Wednesday. Stripe changed their sandbox endpoints; staging may throw errors.
Start with a hypothesis: "If X fails, the system should Y" Run in production (start small โ one instance, one AZ) Minimize blast radius with automatic rollback Build confidence incrementally: staging โ canary โ production
chaos_experiment: name: "Payment DB failover" hypothesis: "If the primary database becomes unavailable, traffic should failover to the replica within 30 seconds with <1% error rate spike" steady_state: - metric: "checkout_success_rate" expected: ">99.5%" - metric: "db_query_duration_p99" expected: "<200ms" injection: type: "network_partition" target: "payment-db-primary" duration: "5 minutes" blast_radius: "single AZ" abort_conditions: - "checkout_success_rate < 95% for > 60 seconds" - "revenue_per_minute drops > 50%" - "any SEV-1 incident declared" results: failover_time: "22 seconds" error_spike: "0.3% for 25 seconds" hypothesis_confirmed: true follow_up_actions: - "Document failover behavior in runbook" - "Add failover time as SLI (target: <30s)"
LevelWhat You TestTools1: ManualKill a pod, see what happenskubectl delete pod2: AutomatedScheduled pod kills, network delaysChaos Monkey, Litmus3: Game DaysMulti-failure scenarios with team exerciseCustom scripts + coordination4: ContinuousAutomated chaos in production with auto-rollbackGremlin, Chaos Mesh
#DriverTypical % of BillOptimization1Log volume40-60%Reduce verbosity, drop DEBUG, sample repetitive2Metric cardinality15-25%Drop unused metrics, limit labels3Trace volume10-20%Sampling, tail-based sampling4Retention10-15%Tiered storage (hot โ warm โ cold)5Query cost5-10%Optimize dashboard queries, set max scan limits
cost_optimization: logs: - action: "Drop DEBUG/TRACE in production" savings: "30-50% of log volume" - action: "Sample health check logs (1:100)" savings: "5-15% of log volume" - action: "Deduplicate identical error bursts" savings: "10-20% during incidents" - action: "Move logs older than 7 days to S3/cold storage" savings: "60-80% of storage cost" - action: "Drop request/response body logging" savings: "20-40% of log volume" metrics: - action: "Audit unused metrics (no dashboard, no alert)" savings: "10-30% of series" - action: "Reduce histogram bucket count (default 11 โ 8)" savings: "~27% of histogram series" - action: "Remove high-cardinality labels" savings: "Variable โ can be massive" - action: "Increase scrape interval for non-critical metrics (15s โ 60s)" savings: "75% of data points for those metrics" traces: - action: "Implement tail-based sampling" savings: "80-95% of trace volume" - action: "Drop internal health check traces" savings: "5-20% of trace volume" - action: "Reduce span attribute size (truncate long strings)" savings: "10-30% of trace storage" general: - action: "Review and right-size retention policies quarterly" - action: "Set query timeouts and result limits on dashboards" - action: "Use recording rules for expensive queries"
observability_cost_review: month: "February 2026" total_cost: "$X,XXX" breakdown: logs: { volume: "X TB", cost: "$X", pct: "X%" } metrics: { series: "X million", cost: "$X", pct: "X%" } traces: { volume: "X TB", cost: "$X", pct: "X%" } infrastructure: { instances: X, cost: "$X", pct: "X%" } cost_per: request: "$0.000X" service: "$X average" engineer: "$X per engineer" optimizations_applied: [] optimizations_planned: [] budget_status: "on_track | over_budget | under_budget"
Every log line includes: trace_id, span_id Every trace span includes: service, operation Every metric includes: service label Correlation paths: Alert fires (metric) โ Click โ Dashboard (metric) โ Filter by time window โ Trace search (same service + time) โ Find failing trace โ Logs (filter by trace_id) โ See exact error Support ticket (user report) โ Find request_id in logs โ Extract trace_id โ View full trace โ Identify slow span โ Check span's service metrics โ Confirm pattern
synthetic_checks: - name: "Checkout flow" type: browser frequency: 5m locations: [us-east, eu-west, ap-southeast] steps: - navigate: "https://app.example.com/products" - click: "Add to Cart" - click: "Checkout" - assert: "Order confirmation page loads in <3s" alert_on: "2 consecutive failures from same location" - name: "API health" type: api frequency: 1m endpoints: - url: "https://api.example.com/health" expected_status: 200 max_latency_ms: 500 - url: "https://api.example.com/v1/products?limit=1" expected_status: 200 max_latency_ms: 1000
# Correlate feature flags with metrics feature_flag_monitoring: - flag: "new_checkout_flow" metrics_to_compare: - "checkout_conversion_rate" # by flag variant - "checkout_error_rate" - "checkout_latency_p99" alerts: - "If error rate for new variant > 2x control, auto-disable flag"
DimensionLevel 1Level 2Level 3Level 4LoggingUnstructured logsStructured JSON, centralizedCorrelated with tracesAutomated log analysisMetricsBasic infra metricsRED/USE for servicesSLO-based with error budgetsPredictive (anomaly detection)TracingNo tracingKey services instrumentedFull distributed tracingTrace-driven testingAlertingStatic thresholdsMulti-signal alertsBurn-rate based on SLOsAuto-remediationIncident ResponseAd hocDefined process + rolesPost-mortems with action trackingChaos engineering in prodCulture"Ops team handles it"Shared ownership (you build it, you run it)SLO-driven development velocityReliability as a feature
DimensionWeight0510Logging quality15%Unstructured, no correlationStructured JSON, missing fieldsFull schema, trace correlation, PII scrubbingMetrics coverage15%No metricsRED or USE, not bothRED + USE + business metrics + customTracing completeness10%No tracingKey servicesFull path, sampling strategy, tail-basedSLO maturity15%No reliability targetsInformal targetsSLOs with error budgets, burn-rate alerts, weekly reviewAlert quality15%Noisy/missingActionable, some runbooksSLO-based, full runbooks, low false positiveIncident response10%Ad hocDefined processFull process, roles, post-mortems, chaos engineeringDashboard design10%No dashboardsBasic panelsHierarchical L1-L4, consistent, linked to alertsCost efficiency10%Unknown costTrackedOptimized, reviewed monthly, within budget 90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.
Structured or it didn't happen โ unstructured logs are technical debt Correlate everything โ trace_id connects logs, traces, and metrics Alert on symptoms, not causes โ users don't care about CPU, they care about latency Every alert gets a runbook โ no runbook = no alert SLOs drive velocity โ error budgets decide when to ship vs stabilize Dashboards have hierarchy โ executives don't need pod CPU graphs Blameless post-mortems always โ blame prevents learning Cost is a feature โ observability that bankrupts you isn't observability You build it, you run it โ the team that ships code owns its observability Practice failure โ chaos engineering builds confidence
CommandWhat It Does"Audit our observability"Run the /16 health check, score each dimension, prioritize gaps"Design logging for [service]"Generate structured log schema with context fields for the service"Set up metrics for [service]"Create RED + USE + business metric instrumentation plan"Create SLOs for [service]"Define SLIs, targets, error budgets, and burn-rate alert rules"Design alerts for [service]"Create alert rules with severity, thresholds, and runbook templates"Build dashboard for [service]"Design L2 service overview dashboard with panel specifications"Write a runbook for [alert]"Generate structured runbook with diagnosis steps and fixes"Run post-mortem for [incident]"Generate blameless post-mortem document with timeline and action items"Set up on-call for [team]"Design rotation, escalation policy, handoff checklist"Plan chaos experiment for [scenario]"Design experiment with hypothesis, injection, abort conditions"Optimize observability costs"Audit current spend, identify top savings, create reduction plan"Design tracing for [system]"Create OpenTelemetry instrumentation plan with sampling strategy
This skill gives you the methodology. For industry-specific implementation patterns: SaaS companies: AfrexAI SaaS Context Pack ($47) โ includes SaaS-specific SLOs, multi-tenant monitoring, and usage-based billing observability Fintech: AfrexAI Fintech Context Pack ($47) โ compliance audit logging, transaction monitoring, fraud detection signals Healthcare: AfrexAI Healthcare Context Pack ($47) โ HIPAA audit trails, PHI access logging, uptime requirements
afrexai-devops-engine โ CI/CD, infrastructure, deployment strategies afrexai-api-architect โ API design, security, versioning afrexai-database-engineering โ Schema design, query optimization, migrations afrexai-code-reviewer โ Code review methodology with SPEAR framework afrexai-prompt-engineering โ System prompt design, testing, optimization Browse all AfrexAI skills: clawhub.com | Full storefront
Long-tail utilities that do not fit the current primary taxonomy cleanly.
Largest current source with strong distribution and engagement signals.