Monitoring & Observability
UC-OPS-002: System Health Monitoring
Purpose: Detect and alert on service degradation.
| Property | Value |
|---|---|
| Actor | System |
| Trigger | Metric threshold breach |
| Priority | P0 |
Main Success Scenario:
- Prometheus scrapes metrics (CPU, Memory, Request Latency).
- CPU usage > 80% for 5 minutes.
- AlertManager triggers PagerDuty incident.
- On-call engineer notified via SMS/Slack.
Acceptance Criteria:
- [ ] Alerts actionable (include runbooks).
- [ ] False positive rate < 10%.