Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Automatically detects, assesses, and safely mitigates incidents in OpenClaw production agents, providing detailed reports and verified recovery.
Automatically detects, assesses, and safely mitigates incidents in OpenClaw production agents, providing detailed reports and verified recovery.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Use this skill when handling incidents, degraded automations, or gateway/memory instability in production.
emergency_recovery handle_incident cron_guard memory_guard gateway_guard
This skill is runbook-only and must operate under least privilege. Allowed read sources: OpenClaw cron state (openclaw cron list --json) Service health/status (systemctl is-active <service>) Recent logs for incident window (journalctl -u <service> --since ... --no-pager) Workspace incident artifacts (/root/.openclaw/workspace/docs/ops/, /root/.openclaw/workspace/memory/) Allowed remediation actions (safe set): Retry a failed job once when failure is transient. Controlled restart of the impacted service only (openclaw-gateway, openclaw, or explicitly named target from incident evidence). Disable/enable only the directly impacted cron job when loop-failing. Add/adjust guardrails in runbook/config docs (non-secret, reversible). Disallowed actions: No credential rotation/deletion. No firewall/network policy mutations. No package installs/upgrades during incident handling. No bulk cron rewrites unrelated to the incident. No edits to unrelated services/components.
Require explicit human approval before: Restarting any production service more than once. Editing cron schedules/timezones. Disabling a job for more than one cycle. Any action with user-visible impact beyond the failing component.
Detect and classify severity (info, degraded, critical). Collect evidence first (status, logs, last run, error streak). Propose smallest remediation from allowed set. Execute only approved/safe remediation. Verify stabilization window (at least one successful cycle). Publish concise incident report.
Never hide persistent failures as success. Never expose secrets/tokens in logs or reports. Prefer reversible actions and document rollback path. Keep blast radius minimal and explicitly stated.
Always include: Incident id/time window Root signal and blast radius Actions executed (and approvals) Evidence (status, key metric, short log excerpt) Final state (resolved, degraded, open) Next check time
"Gateway is flapping, recover safely." "Cron timed out, stabilize and prove fix." "Memory guard firing repeatedly, root-cause and patch."
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.