Skip to content

Monitoring & Observability

UC-OPS-002: System Health Monitoring

Purpose: Detect and alert on service degradation.

Property Value
Actor System
Trigger Metric threshold breach
Priority P0

Main Success Scenario:

  1. Prometheus scrapes metrics (CPU, Memory, Request Latency).
  2. CPU usage > 80% for 5 minutes.
  3. AlertManager triggers PagerDuty incident.
  4. On-call engineer notified via SMS/Slack.

Acceptance Criteria:

  1. [ ] Alerts actionable (include runbooks).
  2. [ ] False positive rate < 10%.