# Send SRE & Incident Management Platform to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "afrexai-sre-platform",
    "name": "SRE & Incident Management Platform",
    "source": "tencent",
    "type": "skill",
    "category": "其他",
    "sourceUrl": "https://clawhub.ai/1kalin/afrexai-sre-platform",
    "canonicalUrl": "https://clawhub.ai/1kalin/afrexai-sre-platform",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/afrexai-sre-platform",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-sre-platform",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "README.md",
      "SKILL.md"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "slug": "afrexai-sre-platform",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-29T10:26:33.836Z",
      "expiresAt": "2026-05-06T10:26:33.836Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-sre-platform",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-sre-platform",
        "contentDisposition": "attachment; filename=\"afrexai-sre-platform-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null,
        "slug": "afrexai-sre-platform"
      },
      "scope": "item",
      "summary": "Item download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this item.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/afrexai-sre-platform"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/afrexai-sre-platform",
    "downloadUrl": "https://openagent3.xyz/downloads/afrexai-sre-platform",
    "agentUrl": "https://openagent3.xyz/skills/afrexai-sre-platform/agent",
    "manifestUrl": "https://openagent3.xyz/skills/afrexai-sre-platform/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/afrexai-sre-platform/agent.md"
  }
}
```
## Documentation

### SRE & Incident Management Platform

Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.

### Phase 1: Reliability Assessment

Before building anything, assess where you are.

### Service Catalog Entry

service:
  name: ""
  tier: ""  # critical | important | standard | experimental
  owner_team: ""
  oncall_rotation: ""
  dependencies:
    upstream: []    # services we call
    downstream: []  # services that call us
  data_classification: ""  # public | internal | confidential | restricted
  deployment_frequency: ""  # daily | weekly | biweekly | monthly
  architecture: ""  # monolith | microservice | serverless | hybrid
  language: ""
  infra: ""  # k8s | ECS | Lambda | VM | bare-metal
  traffic_pattern: ""  # steady | diurnal | spiky | seasonal
  peak_rps: 0
  storage_gb: 0
  monthly_cost_usd: 0

### Maturity Assessment (Score 1-5 per dimension)

Dimension1 (Ad-hoc)3 (Defined)5 (Optimized)ScoreSLOsNo SLOs definedSLOs exist, reviewed quarterlyData-driven SLOs, auto error budgetsMonitoringBasic health checksGolden signals + dashboardsFull observability, anomaly detectionIncident ResponseNo runbooks, hero cultureDocumented process, postmortemsAutomated detection, structured ICSAutomationManual deploymentsCI/CD pipeline, some automationSelf-healing, auto-scaling, GitOpsChaos EngineeringNo testingBasic failure injectionContinuous chaos in productionCapacity PlanningReactive scalingQuarterly forecastingPredictive auto-scalingToil Management>50% toilToil tracked, reduction plans<25% toil, systematic eliminationOn-Call HealthBurnout, 24/7 individualsRotation exists, escalation pathsBalanced load, <2 pages/shift

Score interpretation:

8-16: Firefighting mode — start with SLOs + incident process
17-24: Foundation built — add chaos engineering + toil reduction
25-32: Maturing — optimize error budgets + capacity planning
33-40: Advanced — focus on predictive reliability + culture

### SLI Selection by Service Type

Service TypePrimary SLISecondary SLIsAPI/BackendRequest success rateLatency p50/p95/p99, throughputFrontend/WebPage load (LCP)FID/INP, CLS, error rateData PipelineFreshnessCorrectness, completeness, throughputStorageDurabilityAvailability, latencyStreamingProcessing latencyThroughput, ordering, data loss rateBatch JobSuccess rateDuration, SLA complianceML ModelPrediction latencyAccuracy drift, feature freshness

### SLI Specification Template

sli:
  name: "request_success_rate"
  description: "Proportion of valid requests served successfully"
  type: "availability"  # availability | latency | quality | freshness
  measurement:
    good_events: "HTTP responses with status < 500"
    total_events: "All HTTP requests excluding health checks"
    source: "load balancer access logs"
    aggregation: "sum(good) / sum(total) over rolling 28-day window"
  exclusions:
    - "Health check endpoints (/healthz, /readyz)"
    - "Synthetic monitoring traffic"
    - "Requests from blocked IPs"
    - "4xx responses (client errors)"

### SLO Target Selection Guide

NinesUptime %Downtime/monthAppropriate for2 nines99%7h 18mInternal tools, dev environments2.599.5%3h 39mNon-critical services, backoffice3 nines99.9%43m 50sStandard production services3.599.95%21m 55sImportant customer-facing services4 nines99.99%4m 23sCritical services, payments, auth5 nines99.999%26sLife-safety, financial clearing

Rules for setting targets:

Start lower than you think — you can always tighten
SLO < SLA (always have buffer — typically 0.1-0.5% margin)
Internal SLO < External SLO (catch problems before customers do)
Each nine costs ~10x more to achieve
If you can't measure it, you can't SLO it

### SLO Document Template

slo:
  service: ""
  sli: ""
  target: 99.9  # percentage
  window: "28d"  # rolling window
  error_budget: 0.1  # 100% - target
  error_budget_minutes: 40  # per 28-day window
  
  burn_rate_alerts:
    - name: "fast_burn"
      burn_rate: 14.4  # exhausts budget in 2 hours
      short_window: "5m"
      long_window: "1h"
      severity: "page"
    - name: "medium_burn"
      burn_rate: 6.0   # exhausts budget in ~5 hours
      short_window: "30m"
      long_window: "6h"
      severity: "page"
    - name: "slow_burn"
      burn_rate: 1.0   # exhausts budget in 28 days
      short_window: "6h"
      long_window: "3d"
      severity: "ticket"
  
  review_cadence: "monthly"
  owner: ""
  stakeholders: []
  
  escalation_when_budget_exhausted:
    - "Halt non-critical deployments"
    - "Redirect engineering to reliability work"
    - "Escalate to VP Engineering if no improvement in 48h"

### Error Budget Policy

error_budget_policy:
  service: ""
  
  budget_states:
    healthy:
      condition: "remaining_budget > 50%"
      actions:
        - "Normal development velocity"
        - "Feature work prioritized"
        - "Chaos experiments allowed"
    
    warning:
      condition: "remaining_budget 25-50%"
      actions:
        - "Increase monitoring scrutiny"
        - "Review recent changes for risk"
        - "Limit risky deployments to business hours"
        - "No chaos experiments"
    
    critical:
      condition: "remaining_budget 0-25%"
      actions:
        - "Feature freeze — reliability work only"
        - "All deployments require SRE approval"
        - "Mandatory rollback plan for every change"
        - "Daily error budget review"
    
    exhausted:
      condition: "remaining_budget <= 0"
      actions:
        - "Complete deployment freeze"
        - "All engineering redirected to reliability"
        - "VP Engineering notified"
        - "Postmortem required for budget exhaustion"
        - "Freeze maintained until budget recovers to 10%"
  
  exceptions:
    - "Security patches always allowed"
    - "Regulatory compliance changes always allowed"
    - "Data loss prevention always allowed"
  
  reset: "Rolling 28-day window (no manual resets)"

### Burn Rate Calculation

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
- SLO: 99.9% (error budget = 0.1%)
- Current error rate: 0.5%
- Burn rate = 0.5% / 0.1% = 5x

At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days

### Error Budget Dashboard

Track weekly:

MetricCurrentTrendStatusBudget remaining (%)↑↓→🟢🟡🔴Budget consumed this weekBurn rate (1h / 6h / 24h)Incidents consuming budgetTop error contributorProjected exhaustion date

### Four Golden Signals

SignalWhat to MeasureAlert WhenLatencyp50, p95, p99 response timep99 > 2x baseline for 5 minTrafficRequests/sec, concurrent users>30% drop (indicates upstream issue) OR >50% spikeErrors5xx rate, timeout rate, exception rateError rate > SLO burn rate thresholdSaturationCPU, memory, disk, connections, queue depth>80% sustained for 10 min

### USE Method (Infrastructure)

For every resource, track:

Utilization: % of capacity used (0-100%)
Saturation: queue depth / wait time (0 = no waiting)
Errors: error count / error rate

### RED Method (Services)

For every service, track:

Rate: requests per second
Errors: failed requests per second
Duration: latency distribution

### Alert Design Rules

Every alert must have a runbook link — no exceptions
Every alert must be actionable — if you can't act on it, delete it
Symptoms over causes — alert on "users can't check out" not "database CPU high"
Multi-window, multi-burn-rate — avoid single-threshold alerts
Page only for customer impact — everything else is a ticket
Alert fatigue = death — review alert volume monthly; target <5 pages/week per service

### Alert Severity Guide

SeverityResponse TimeNotificationExamplesP0/Page<5 minPagerDuty + phoneSLO burn rate critical, data loss, security breachP1/Urgent<30 minSlack + PagerDutyDegraded service, elevated errors, capacity warningP2/TicketNext business dayTicket auto-createdSlow burn, non-critical component downP3/LogWeekly reviewDashboard onlyInformational, trend detection

### Structured Log Standard

{
  "timestamp": "2026-02-17T11:24:00.000Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Payment processing failed",
  "error_type": "TimeoutException",
  "error_message": "Gateway timeout after 30s",
  "http_method": "POST",
  "http_path": "/api/v1/payments",
  "http_status": 504,
  "duration_ms": 30012,
  "customer_id": "cust_xxx",
  "payment_id": "pay_yyy",
  "amount_cents": 4999,
  "retry_count": 2,
  "environment": "production",
  "host": "payment-api-7b4d9-xk2p1",
  "region": "us-east-1"
}

### Severity Classification Matrix

Impact: 1 UserImpact: <25% UsersImpact: >25% UsersImpact: All UsersCore function downSEV3SEV2SEV1SEV1Degraded performanceSEV4SEV3SEV2SEV1Non-core feature downSEV4SEV3SEV3SEV2Cosmetic/minorSEV4SEV4SEV3SEV3

Auto-escalation triggers:

Any data loss → SEV1 minimum
Security breach with PII → SEV1
Revenue-impacting → SEV1 or SEV2
SLA breach imminent → auto-escalate one level

### Incident Command System (ICS)

RoleResponsibilityAssignedIncident Commander (IC)Owns resolution, makes decisions, manages timelineCommunications LeadStatus updates, stakeholder comms, customer-facingOperations LeadHands-on-keyboard, executing fixesSubject Matter ExpertDeep knowledge of affected systemScribeDocumenting timeline, actions, decisions

IC Rules:

IC does NOT debug — IC coordinates
IC makes final decisions when team disagrees
IC can escalate severity at any time
IC owns handoff if rotation changes
IC calls end-of-incident

### Incident Response Workflow

DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW

Step 1: DETECT (0-5 min)
├── Alert fires OR user report received
├── On-call acknowledges within SLA
└── Quick assessment: is this real? What severity?

Step 2: TRIAGE (5-15 min)
├── Classify severity using matrix above
├── Assign IC and roles
├── Open incident channel (#inc-YYYY-MM-DD-title)
├── Post initial status update
└── Start timeline document

Step 3: RESPOND (15 min - ongoing)
├── IC briefs team: "Here's what we know, here's what we don't"
├── Operations Lead begins investigation
├── Check: recent deployments? Config changes? Dependency issues?
├── Parallel investigation tracks if needed
└── 15-minute check-ins for SEV1, 30-min for SEV2

Step 4: MITIGATE (ASAP)
├── Priority: STOP THE BLEEDING
├── Options (fastest first):
│   ├── Rollback last deployment
│   ├── Feature flag disable
│   ├── Traffic shift / failover
│   ├── Scale up / circuit breaker
│   └── Manual data fix
├── Mitigated ≠ Resolved — temporary fix is OK
└── Update status: "Impact mitigated, root cause investigation ongoing"

Step 5: RESOLVE
├── Root cause identified and fixed
├── Verification: SLIs back to normal for 30+ minutes
├── All-clear communicated
└── IC declares incident resolved

Step 6: REVIEW (within 5 business days)
├── Blameless postmortem written
├── Action items assigned with owners and deadlines
├── Postmortem review meeting
└── Action items tracked to completion

### Communication Templates

Initial notification (internal):

🔴 INCIDENT: [Title]
Severity: SEV[X]
Impact: [Who/what is affected]
Status: Investigating
IC: [Name]
Channel: #inc-[date]-[slug]
Next update: [time]

Customer-facing status:

[Service] - Investigating increased error rates

We are currently investigating reports of [symptom]. 
Some users may experience [user-visible impact].
Our team is actively working on a resolution.
We will provide an update within [time].

Resolution notification:

✅ RESOLVED: [Title]
Duration: [X hours Y minutes]
Impact: [Summary]
Root cause: [One sentence]
Postmortem: [Link] (within 5 business days)

### Blameless Postmortem Template

postmortem:
  title: ""
  date: ""
  severity: ""  # SEV1-4
  duration: ""  # total incident duration
  authors: []
  reviewers: []
  status: "draft"  # draft | in-review | final
  
  summary: |
    One paragraph: what happened, what was the impact, how was it resolved.
  
  impact:
    users_affected: 0
    duration_minutes: 0
    revenue_impact_usd: 0
    slo_budget_consumed_pct: 0
    data_loss: false
    customer_tickets: 0
  
  timeline:
    - time: ""
      event: ""
      # Chronological, every significant event
      # Include detection time, escalation, mitigation attempts
  
  root_cause: |
    Technical explanation of WHY it happened.
    Go deep — surface causes are not root causes.
  
  contributing_factors:
    - ""  # What made it worse or delayed resolution?
  
  detection:
    how_detected: ""  # alert | user report | manual check
    time_to_detect_minutes: 0
    could_have_detected_sooner: ""
  
  resolution:
    how_resolved: ""
    time_to_mitigate_minutes: 0
    time_to_resolve_minutes: 0
  
  what_went_well:
    - ""  # Explicitly call out what worked
  
  what_went_wrong:
    - ""
  
  where_we_got_lucky:
    - ""  # Things that could have made it worse
  
  action_items:
    - id: "AI-001"
      type: ""  # prevent | detect | mitigate | process
      description: ""
      owner: ""
      priority: ""  # P0 | P1 | P2
      deadline: ""
      status: "open"  # open | in-progress | done
      ticket: ""

### Root Cause Analysis Methods

Five Whys (simple incidents):

Why did users see errors? → API returned 500s
Why did API return 500s? → Database connection pool exhausted
Why was pool exhausted? → Long-running query held connections
Why was query long-running? → Missing index on new column
Why was index missing? → Migration didn't include index; no query performance review in CI

→ Root cause: No automated query performance check in deployment pipeline
→ Action: Add query plan analysis to CI for migration PRs

Fishbone / Ishikawa (complex incidents):

Categories to investigate:
├── People: Training? Fatigue? Communication?
├── Process: Runbook? Escalation? Change management?
├── Technology: Bug? Config? Capacity? Dependency?
├── Environment: Network? Cloud provider? Third party?
├── Monitoring: Detection gap? Alert fatigue? Dashboard gap?
└── Testing: Test coverage? Load testing? Chaos testing?

Contributing Factor Categories:

CategoryQuestionsTriggerWhat change or event started it?PropagationWhy did it spread? Why wasn't it contained?DetectionWhy wasn't it caught earlier?ResolutionWhat slowed the fix?ProcessWhat process gaps contributed?

### Postmortem Review Meeting (60 min)

1. Timeline walk-through (15 min)
   - Author presents chronology
   - Attendees add context ("I remember seeing X at this point")

2. Root cause deep-dive (15 min)  
   - Do we agree on root cause?
   - Are there additional contributing factors?

3. Action item review (20 min)
   - Are these the RIGHT actions?
   - Are they prioritized correctly?
   - Do owners agree on deadlines?

4. Process improvements (10 min)
   - Could we have detected this sooner?
   - Could we have resolved this faster?
   - What would have prevented this entirely?

### Chaos Maturity Model

LevelNameActivities0NoneNo chaos testing1ExploratoryManual fault injection in staging2SystematicScheduled chaos experiments in staging3ProductionControlled chaos in production (Game Days)4ContinuousAutomated chaos in production with safety controls

### Chaos Experiment Template

experiment:
  name: ""
  hypothesis: "When [fault], the system will [expected behavior]"
  
  steady_state:
    metrics:
      - name: ""
        baseline: ""
        acceptable_range: ""
  
  method:
    fault_type: ""  # network | compute | storage | dependency | data
    target: ""      # which service/component
    blast_radius: ""  # single pod | single AZ | percentage of traffic
    duration: ""
    
  safety:
    abort_conditions:
      - "SLO burn rate exceeds 10x"
      - "Customer-visible errors detected"
      - "Alert fires that we didn't expect"
    rollback_plan: ""
    required_approvals: []
    
  results:
    outcome: ""  # confirmed | disproved | inconclusive
    observations: []
    action_items: []

### Chaos Experiment Library

CategoryExperimentValidatesNetworkAdd 200ms latency to DB callsTimeout handling, circuit breakersNetworkDrop 5% of packets to downstreamRetry logic, error handlingNetworkDNS resolution failureCaching, fallback, error messagesComputeKill random pod every 10 minAuto-restart, load balancingComputeCPU stress to 95% on 1 nodeAuto-scaling, graceful degradationComputeFill disk to 95%Disk monitoring, log rotation, alertsStorageIncrease DB latency 5xConnection pool handling, timeoutsStorageSimulate cache failure (Redis down)Cache-aside pattern, DB fallbackDependencyBlock external API (payment provider)Circuit breaker, queuing, retryDependencyReturn 429s from auth serviceRate limit handling, backoffDataClock skew on subset of nodesTimestamp handling, orderingScale10x traffic spike over 5 minutesAuto-scaling speed, queue depth

### Game Day Runbook

PRE-GAME (1 week before):
□ Experiment designed and reviewed
□ Steady-state metrics identified
□ Abort conditions defined
□ All participants briefed
□ Runbacks tested in staging
□ Stakeholders notified

GAME DAY:
□ Verify steady state (15 min baseline)
□ Announce in #engineering: "Chaos Game Day starting"
□ Inject fault
□ Observe and document
□ If abort condition hit → rollback immediately
□ Run for planned duration
□ Remove fault
□ Verify recovery to steady state

POST-GAME (same day):
□ Results documented
□ Surprises noted
□ Action items created
□ Share findings in team meeting

### Toil Identification

Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.

### Toil Inventory Template

toil_item:
  name: ""
  category: ""  # deployment | scaling | config | data | access | monitoring | recovery
  frequency: ""  # daily | weekly | monthly | per-incident
  time_per_occurrence_min: 0
  occurrences_per_month: 0
  total_hours_per_month: 0
  teams_affected: []
  automation_difficulty: ""  # low | medium | high
  automation_value: 0  # hours saved per month
  priority_score: 0  # value / difficulty

### Toil Reduction Priority Matrix

Low EffortMedium EffortHigh EffortHigh Value (>10 hrs/mo)DO FIRSTDO SECONDPLANMed Value (2-10 hrs/mo)DO SECONDPLANEVALUATELow Value (<2 hrs/mo)QUICK WINSKIPSKIP

### Common Toil Targets (Ranked by Impact)

Manual deployments → CI/CD pipeline + GitOps
Access provisioning → Self-service + auto-approval for low-risk
Certificate renewals → Auto-renewal (cert-manager, Let's Encrypt)
Scaling decisions → HPA + predictive auto-scaling
Log investigation → Structured logging + correlation + dashboards
Data fixes → Self-service admin tools + validation at ingestion
Config changes → Config-as-code + automated rollout
Incident response → Automated runbooks for known issues
Capacity reporting → Automated dashboards + forecasting
On-call triage → Noise reduction + auto-remediation for known patterns

### Toil Budget Rule

Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.

### Capacity Model Template

capacity_model:
  service: ""
  bottleneck_resource: ""  # CPU | memory | storage | connections | bandwidth
  
  current_state:
    peak_utilization_pct: 0
    headroom_pct: 0
    cost_per_month_usd: 0
    
  growth_forecast:
    metric: ""  # MAU | requests/sec | storage_gb
    current: 0
    monthly_growth_pct: 0
    projected_6mo: 0
    projected_12mo: 0
    
  scaling_strategy:
    type: ""  # horizontal | vertical | hybrid
    auto_scaling: true
    min_instances: 0
    max_instances: 0
    scale_up_threshold: 80  # % utilization
    scale_down_threshold: 30
    cooldown_seconds: 300
    
  cost_projection:
    current_monthly: 0
    projected_6mo_monthly: 0
    projected_12mo_monthly: 0

### Capacity Planning Cadence

FrequencyActionDailyReview auto-scaling events, check for anomaliesWeeklyReview utilization trends, spot-check headroomMonthlyUpdate growth model, review cost projectionsQuarterlyFull capacity review, budget planning, architecture checkPre-launchLoad test to 2x expected peak, verify scaling

### Load Testing Benchmarks

ScenarioMethodDurationTargetBaselineSteady load at current peak30 minEstablish metricsGrowth2x current peak15 minVerify scaling worksSpike10x normal in 60 seconds5 minCircuit breakers holdSoak1.5x normal load4 hoursNo memory leaks, degradationStressRamp until failureUntil breakFind actual limits

### On-Call Health Metrics

MetricHealthyWarningCriticalPages per shift<22-5>5Off-hours pages<1/week1-3/week>3/weekTime to acknowledge<5 min5-15 min>15 minTime to mitigate<30 min30-60 min>60 minFalse positive rate<10%10-30%>30%Escalation rate<20%20-40%>40%On-call satisfaction>4/53-4/5<3/5

### On-Call Rotation Best Practices

Minimum rotation size: 5 people (one week on, four weeks off)
No back-to-back weeks unless team is too small (fix the team size)
Follow-the-sun for global teams (no one pages at 3 AM if avoidable)
Primary + secondary on-call always
Handoff document at rotation change — open issues, recent deploys, known risks
Compensation — on-call pay, time off in lieu, or equivalent

### On-Call Handoff Template

## On-Call Handoff: [Date]

### Open Issues
- [Issue]: [Status, next steps]

### Recent Changes (last 7 days)
- [Deployment/config change]: [Risk level, rollback plan]

### Known Risks
- [Event/condition]: [What to watch for]

### Scheduled Maintenance
- [When]: [What, duration, rollback plan]

### Runbook Updates
- [Any new/updated runbooks since last rotation]

### Runbook Template

runbook:
  title: ""
  alert_name: ""  # exact alert that triggers this
  last_updated: ""
  owner: ""
  
  overview: |
    What this alert means in plain English.
    
  impact: |
    What users/systems are affected and how.
    
  diagnosis:
    - step: "Check service health"
      command: ""
      expected: ""
      if_unexpected: ""
    - step: "Check recent deployments"
      command: ""
      expected: ""
      if_unexpected: "Rollback: [command]"
    - step: "Check dependencies"
      command: ""
      expected: ""
      if_unexpected: ""
      
  mitigation:
    - option: "Rollback"
      when: "Recent deployment suspected"
      steps: []
    - option: "Scale up"
      when: "Traffic spike"
      steps: []
    - option: "Failover"
      when: "Single component failure"
      steps: []
      
  escalation:
    after_minutes: 30
    contact: ""
    context_to_provide: ""

### Weekly SRE Review (30 min)

1. SLO Status (5 min)
   - Budget remaining per service
   - Any burn rate alerts this week?

2. Incident Review (10 min)
   - Incidents this week: count, severity, duration
   - Open postmortem action items: status check

3. On-Call Health (5 min)
   - Pages this week (total, off-hours, false positives)
   - Any on-call feedback?

4. Reliability Work (10 min)
   - Automation shipped this week
   - Toil reduced (hours saved)
   - Chaos experiments run
   - Capacity concerns

### Monthly Reliability Report

monthly_report:
  period: ""
  
  slo_summary:
    services_meeting_slo: 0
    services_breaching_slo: 0
    worst_performing: ""
    
  incidents:
    total: 0
    by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 }
    mttr_minutes: 0
    mttd_minutes: 0
    repeat_incidents: 0
    
  error_budget:
    services_in_healthy: 0
    services_in_warning: 0
    services_in_critical: 0
    services_exhausted: 0
    
  toil:
    hours_spent: 0
    hours_automated_away: 0
    pct_of_sre_time: 0
    
  on_call:
    total_pages: 0
    off_hours_pages: 0
    false_positive_pct: 0
    avg_ack_time_min: 0
    
  action_items:
    open: 0
    completed_this_month: 0
    overdue: 0
    
  highlights: []
  concerns: []
  next_month_priorities: []

### Production Readiness Review Checklist

Before any new service goes to production:

CategoryCheckStatusSLOsSLIs defined and measuredSLOsSLO targets set with stakeholder agreementSLOsError budget policy documentedMonitoringGolden signals dashboardedMonitoringAlerting configured with runbooksMonitoringStructured logging implementedMonitoringDistributed tracing enabledIncidentsOn-call rotation establishedIncidentsEscalation paths documentedIncidentsRunbooks for top 5 failure modesCapacityLoad tested to 2x expected peakCapacityAuto-scaling configured and testedCapacityResource limits set (CPU, memory)ResilienceGraceful degradation implementedResilienceCircuit breakers for dependenciesResilienceRetry with exponential backoffResilienceTimeout configured for all external callsDeployRollback tested and documentedDeployCanary/blue-green deployment readyDeployFeature flags for risky featuresSecurityAuthentication and authorizationSecuritySecrets in vault (not env vars)SecurityDependencies scannedDataBackup and restore testedDataData retention policy definedDocsArchitecture diagram currentDocsAPI documentation publishedDocsOperational runbook complete

### Self-Healing Automation

auto_remediation:
  - trigger: "pod_crash_loop"
    condition: "restart_count > 3 in 10 min"
    action: "Delete pod, let scheduler reschedule"
    escalate_if: "Still crashing after 3 auto-remediations"
    
  - trigger: "disk_usage_high"
    condition: "disk_usage > 85%"
    action: "Run log cleanup script, archive old data"
    escalate_if: "Still above 85% after cleanup"
    
  - trigger: "connection_pool_exhausted"
    condition: "available_connections = 0"
    action: "Kill idle connections, increase pool temporarily"
    escalate_if: "Pool exhausted again within 1 hour"
    
  - trigger: "certificate_expiring"
    condition: "days_until_expiry < 14"
    action: "Trigger cert renewal"
    escalate_if: "Renewal fails"

### Multi-Region Reliability

StrategyComplexityRTOCostActive-passiveLowMinutes1.5xActive-active readMediumSeconds1.8xActive-active fullHighNear-zero2-3xCell-basedVery highPer-cell2-4x

Decision guide:

SLO < 99.9% → Single region with good backups
SLO 99.9-99.95% → Active-passive with automated failover
SLO > 99.95% → Active-active (read or full)
SLO > 99.99% → Cell-based architecture

### Reliability Culture Indicators

Healthy signals:

Postmortems are blameless and well-attended
Error budgets are respected (feature freeze actually happens)
On-call is shared fairly and compensated
Toil is tracked and reducing quarter-over-quarter
Chaos experiments happen regularly
Teams own their reliability (not just SRE)

Warning signs:

"Hero culture" — same person always saves the day
Postmortems are blame-focused or skipped
Error budget exhaustion doesn't change behavior
On-call is dreaded, same 2 people always paged
"We'll fix reliability after this feature ships" (always)
SRE team is just an ops team with a new name

### Quality Scoring Rubric (0-100)

DimensionWeight0-23-45SLO Coverage20%No SLOsSLOs for critical servicesAll services with SLOs, error budgets, reviewsMonitoring15%Basic health checksGolden signals + dashboardsFull observability stack + anomaly detectionIncident Response15%Ad-hoc, no processICS roles, runbooks, postmortemsStructured ICS, blameless culture, action trackingAutomation15%Manual everythingCI/CD + some automationSelf-healing, GitOps, <25% toilChaos Engineering10%NoneStaging experimentsContinuous production chaos with safetyCapacity Planning10%ReactiveQuarterly forecastingPredictive, auto-scaling, cost-optimizedOn-Call Health10%Burnout, hero cultureFair rotation, <5 pages/shiftBalanced, compensated, <2 pages/shiftDocumentation5%Nothing writtenRunbooks existComplete, current, tested runbooks

### Natural Language Commands

"Assess reliability for [service]" → Run maturity assessment
"Define SLOs for [service]" → Walk through SLI selection + SLO setting
"Check error budget for [service]" → Calculate current budget status
"Start incident for [description]" → Create incident channel, assign IC, begin workflow
"Write postmortem for [incident]" → Generate structured postmortem
"Plan chaos experiment for [service]" → Design experiment with hypothesis
"Audit toil for [team]" → Inventory and prioritize toil
"Review on-call health" → Analyze page volume, satisfaction, fairness
"Production readiness review for [service]" → Run full checklist
"Monthly reliability report" → Generate comprehensive report
"Design runbook for [alert]" → Create structured runbook
"Plan capacity for [service] growing at [X%]" → Build capacity model
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: 1kalin
- Version: 1.0.0
## Source health
- Status: healthy
- Item download looks usable.
- Yavira can redirect you to the upstream package for this item.
- Health scope: item
- Reason: direct_download_ok
- Checked at: 2026-04-29T10:26:33.836Z
- Expires at: 2026-05-06T10:26:33.836Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/afrexai-sre-platform)
- [Send to Agent page](https://openagent3.xyz/skills/afrexai-sre-platform/agent)
- [JSON manifest](https://openagent3.xyz/skills/afrexai-sre-platform/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/afrexai-sre-platform/agent.md)
- [Download page](https://openagent3.xyz/downloads/afrexai-sre-platform)