← All skills
Tencent SkillHub Β· AI

Logging Observability

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.

⬇ 0 downloads β˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
README.md, SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
0.1.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 21 sections Open source page

Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

PillarPurposeQuestion It AnswersExampleLogsWhat happenedWhy did this request fail?{"level":"error","msg":"payment declined","user_id":"u_82"}MetricsHow much / how fastIs latency increasing?http_request_duration_seconds{route="/api/orders"} 0.342TracesRequest flowWhere is the bottleneck?Span: api-gateway β†’ auth β†’ order-service β†’ db Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.

Structured Logging

Always emit logs as structured JSON β€” never free-text strings.

Required Fields

FieldPurposeRequiredtimestampISO-8601 with millisecondsYeslevelSeverity (DEBUG … FATAL)YesserviceOriginating service nameYesmessageHuman-readable descriptionYestrace_idDistributed trace correlationYesspan_idCurrent span within traceYescorrelation_idBusiness-level correlation (order ID)When applicableerrorStructured error objectOn errorscontextRequest-specific metadataRecommended

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically: app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), user_id: req.user?.id, method: req.method, path: req.path, }; asyncLocalStorage.run(ctx, () => next()); });

Library Recommendations

LibraryLanguageStrengthsPerfPinoNode.jsFastest Node logger, low overheadExcellentstructlogPythonComposable processors, context bindingGoodzerologGoZero-allocation JSON loggingExcellentzapGoHigh performance, typed fieldsExcellenttracingRustSpans + events, async-awareExcellent Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

Log Levels

LevelWhen to UseExampleFATALApp cannot continue, process will exitDatabase connection pool exhaustedERROROperation failed, needs attentionPayment charge failed: CARD_DECLINEDWARNUnexpected but recoverableRetry 2/3 for upstream timeoutINFONormal business eventsOrder ORD-1234 placed successfullyDEBUGDeveloper troubleshootingCache miss for key user:82:preferencesTRACEVery fine-grained (rarely in prod)Entering validateAddress with payload Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs: import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; const sdk = new NodeSDK({ serviceName: 'order-service', traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

Span Creation

const tracer = trace.getTracer('order-service'); async function processOrder(order: Order) { return tracer.startActiveSpan('processOrder', async (span) => { try { span.setAttribute('order.id', order.id); span.setAttribute('order.total_cents', order.totalCents); await validateInventory(order); await chargePayment(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); }

Context Propagation

Use W3C Trace Context (traceparent header) β€” default in OTel Propagate across HTTP, gRPC, and message queues For async workers: serialise traceparent into the job payload

Trace Sampling

StrategyUse WhenAlways OnLow-traffic services, debuggingProbabilistic (N%)General production useRate-limited (N/sec)High-throughput servicesTail-basedWhen you need all error traces Always sample 100% of error traces regardless of strategy.

RED Method (Request-Driven)

Monitor these three for every service endpoint: MetricWhat It MeasuresPrometheus ExampleRateRequests/secrate(http_requests_total[5m])ErrorsFailed request ratiorate(http_requests_total{status=~"5.."}[5m])DurationResponse timehistogram_quantile(0.99, http_request_duration_seconds)

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network): MetricWhat It MeasuresExampleUtilization% resource busyCPU usage at 78%SaturationWork queued/waiting12 requests queued in thread poolErrorsError events on resource3 disk I/O errors in last minute

Monitoring Stack

ToolCategoryBest ForPrometheusMetricsPull-based metrics, alerting rulesGrafanaVisualisationDashboards for metrics, logs, tracesJaegerTracingDistributed trace visualisationLokiLogsLog aggregation (pairs with Grafana)OpenTelemetryCollectionVendor-neutral telemetry collection Recommendation: Start with OTel Collector β†’ Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

Severity Levels

SeverityResponse TimeExampleP1ImmediateService fully down, data lossP2< 30 minError rate > 5%, latency p99 > 5sP3Business hoursDisk > 80%, cert expiring in 7 daysP4Best effortNon-critical deprecation warning

Alert Fatigue Prevention

Alert on symptoms, not causes β€” "error rate > 5%" not "pod restarted" Multi-window, multi-burn-rate β€” catch both sudden spikes and slow burns Require runbook links β€” every alert must link to diagnosis and remediation Review monthly β€” delete or tune alerts that never fire or always fire Group related alerts β€” use inhibition rules to suppress child alerts Set appropriate thresholds β€” if alert fires daily and is ignored, raise threshold or delete

Overview Dashboard ("War Room")

Total requests/sec across all services Global error rate (%) with trendline p50 / p95 / p99 latency Active alerts count by severity Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

RED metrics for each endpoint Dependency health (upstream/downstream success rates) Resource utilisation (CPU, memory, connections) Top errors table with count and last seen

Observability Checklist

Every service must have: Structured JSON logging with consistent schema Correlation / trace IDs propagated on all requests RED metrics exposed for every external endpoint Health check endpoints (/healthz and /readyz) Distributed tracing with OpenTelemetry Dashboards for RED metrics and resource utilisation Alerts for error rate, latency, and saturation with runbook links Log level configurable at runtime without redeployment PII scrubbing verified and tested Retention policies defined for logs, metrics, and traces

Anti-Patterns

Anti-PatternProblemFixLogging PIIPrivacy/compliance violationMask or exclude PII; use token referencesExcessive loggingStorage costs balloon, signal drownsLog business events, not data flowUnstructured logsCannot query or alert on fieldsUse structured JSON with consistent schemaString interpolationBreaks structured fields, injection riskPass fields as metadata, not in messageMissing correlation IDsCannot trace across servicesGenerate and propagate trace_id everywhereAlert stormsOn-call fatigue, real issues buriedUse grouping, inhibition, deduplicationMetrics with high cardinalityPrometheus OOM, dashboard timeoutsNever use user ID or request ID as label

NEVER Do

NEVER log passwords, tokens, API keys, or secrets β€” even at DEBUG level NEVER use console.log / print in production β€” use a structured logger NEVER use user IDs, emails, or request IDs as metric labels β€” cardinality will explode NEVER create alerts without a runbook link β€” unactionable alerts erode trust NEVER rely on logs alone β€” you need metrics and traces for full observability NEVER log request/response bodies by default β€” opt-in only, with PII redaction NEVER ignore log volume β€” set budgets and alert when a service exceeds daily quota NEVER skip context propagation in async flows β€” broken traces are worse than no traces

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
2 Docs
  • SKILL.md Primary doc
  • README.md Docs