Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
PillarPurposeQuestion It AnswersExampleLogsWhat happenedWhy did this request fail?{"level":"error","msg":"payment declined","user_id":"u_82"}MetricsHow much / how fastIs latency increasing?http_request_duration_seconds{route="/api/orders"} 0.342TracesRequest flowWhere is the bottleneck?Span: api-gateway β auth β order-service β db Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.
Always emit logs as structured JSON β never free-text strings.
FieldPurposeRequiredtimestampISO-8601 with millisecondsYeslevelSeverity (DEBUG β¦ FATAL)YesserviceOriginating service nameYesmessageHuman-readable descriptionYestrace_idDistributed trace correlationYesspan_idCurrent span within traceYescorrelation_idBusiness-level correlation (order ID)When applicableerrorStructured error objectOn errorscontextRequest-specific metadataRecommended
Attach context at the middleware level so downstream logs inherit automatically: app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), user_id: req.user?.id, method: req.method, path: req.path, }; asyncLocalStorage.run(ctx, () => next()); });
LibraryLanguageStrengthsPerfPinoNode.jsFastest Node logger, low overheadExcellentstructlogPythonComposable processors, context bindingGoodzerologGoZero-allocation JSON loggingExcellentzapGoHigh performance, typed fieldsExcellenttracingRustSpans + events, async-awareExcellent Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
LevelWhen to UseExampleFATALApp cannot continue, process will exitDatabase connection pool exhaustedERROROperation failed, needs attentionPayment charge failed: CARD_DECLINEDWARNUnexpected but recoverableRetry 2/3 for upstream timeoutINFONormal business eventsOrder ORD-1234 placed successfullyDEBUGDeveloper troubleshootingCache miss for key user:82:preferencesTRACEVery fine-grained (rarely in prod)Entering validateAddress with payload Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
Always prefer OpenTelemetry over vendor-specific SDKs: import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; const sdk = new NodeSDK({ serviceName: 'order-service', traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();
const tracer = trace.getTracer('order-service'); async function processOrder(order: Order) { return tracer.startActiveSpan('processOrder', async (span) => { try { span.setAttribute('order.id', order.id); span.setAttribute('order.total_cents', order.totalCents); await validateInventory(order); await chargePayment(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); }
Use W3C Trace Context (traceparent header) β default in OTel Propagate across HTTP, gRPC, and message queues For async workers: serialise traceparent into the job payload
StrategyUse WhenAlways OnLow-traffic services, debuggingProbabilistic (N%)General production useRate-limited (N/sec)High-throughput servicesTail-basedWhen you need all error traces Always sample 100% of error traces regardless of strategy.
Monitor these three for every service endpoint: MetricWhat It MeasuresPrometheus ExampleRateRequests/secrate(http_requests_total[5m])ErrorsFailed request ratiorate(http_requests_total{status=~"5.."}[5m])DurationResponse timehistogram_quantile(0.99, http_request_duration_seconds)
For infrastructure components (CPU, memory, disk, network): MetricWhat It MeasuresExampleUtilization% resource busyCPU usage at 78%SaturationWork queued/waiting12 requests queued in thread poolErrorsError events on resource3 disk I/O errors in last minute
ToolCategoryBest ForPrometheusMetricsPull-based metrics, alerting rulesGrafanaVisualisationDashboards for metrics, logs, tracesJaegerTracingDistributed trace visualisationLokiLogsLog aggregation (pairs with Grafana)OpenTelemetryCollectionVendor-neutral telemetry collection Recommendation: Start with OTel Collector β Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
SeverityResponse TimeExampleP1ImmediateService fully down, data lossP2< 30 minError rate > 5%, latency p99 > 5sP3Business hoursDisk > 80%, cert expiring in 7 daysP4Best effortNon-critical deprecation warning
Alert on symptoms, not causes β "error rate > 5%" not "pod restarted" Multi-window, multi-burn-rate β catch both sudden spikes and slow burns Require runbook links β every alert must link to diagnosis and remediation Review monthly β delete or tune alerts that never fire or always fire Group related alerts β use inhibition rules to suppress child alerts Set appropriate thresholds β if alert fires daily and is ignored, raise threshold or delete
Total requests/sec across all services Global error rate (%) with trendline p50 / p95 / p99 latency Active alerts count by severity Deployment markers overlaid on graphs
RED metrics for each endpoint Dependency health (upstream/downstream success rates) Resource utilisation (CPU, memory, connections) Top errors table with count and last seen
Every service must have: Structured JSON logging with consistent schema Correlation / trace IDs propagated on all requests RED metrics exposed for every external endpoint Health check endpoints (/healthz and /readyz) Distributed tracing with OpenTelemetry Dashboards for RED metrics and resource utilisation Alerts for error rate, latency, and saturation with runbook links Log level configurable at runtime without redeployment PII scrubbing verified and tested Retention policies defined for logs, metrics, and traces
Anti-PatternProblemFixLogging PIIPrivacy/compliance violationMask or exclude PII; use token referencesExcessive loggingStorage costs balloon, signal drownsLog business events, not data flowUnstructured logsCannot query or alert on fieldsUse structured JSON with consistent schemaString interpolationBreaks structured fields, injection riskPass fields as metadata, not in messageMissing correlation IDsCannot trace across servicesGenerate and propagate trace_id everywhereAlert stormsOn-call fatigue, real issues buriedUse grouping, inhibition, deduplicationMetrics with high cardinalityPrometheus OOM, dashboard timeoutsNever use user ID or request ID as label
NEVER log passwords, tokens, API keys, or secrets β even at DEBUG level NEVER use console.log / print in production β use a structured logger NEVER use user IDs, emails, or request IDs as metric labels β cardinality will explode NEVER create alerts without a runbook link β unactionable alerts erode trust NEVER rely on logs alone β you need metrics and traces for full observability NEVER log request/response bodies by default β opt-in only, with PII redaction NEVER ignore log volume β set budgets and alert when a service exceeds daily quota NEVER skip context propagation in async flows β broken traces are worse than no traces
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.