Ops & Knowledge Overview
Document Purpose: This document outlines the operational procedures, knowledge management systems, and monitoring infrastructure that adhere to Site Reliability Engineering (SRE) best practices. It empowers the team to maintain high availability and rapid recovery.
Executive Summary
Operational excellence at Cazo is built on three pillars: Observable Systems, Automated Recovery, and Shared Knowledge. We treat operations as a software problem, automating repetitive tasks to focus on reliability.
Incident Response Workflow
We follow a structured incident response process to minimize downtime during critical outages (e.g., Booking System Failure).
stateDiagram-v2
[*] --> Detected: Alert Triggered (e.g. Booking API 500s)
Detected --> Acknowledged: On-Call Engineer Acks
Detected --> Escalated: No Ack in 15m
Acknowledged --> Investigating: Diagnosis Started
Escalated --> Investigating: Senior Eng Paged
Investigating --> Mitigating: Root Cause Found
Investigating --> Escalated: >30m Resolution Time
Mitigating --> Resolved: Fix/Rollback Deployed
Mitigating --> Rollback: Fix Failed
Rollback --> Mitigating: Retry Alternative
Resolved --> PostMortem: Required for Sev-1
PostMortem --> [*]: RCA & Action Items Filed
note right of Detected
Sev-1: <15m Response (System Down)
Sev-2: <4h Response (Feature Broken)
Sev-3: Next Business Day (Minor Bug)
end note
1. Operational Capabilities by Use Case
| Category | Operational Capability | Description |
|---|---|---|
| Core Operations | Tenant Isolation | Ensuring one salon's data never leaks to another (Row-level security). |
| Smart Rostering | Auto-scaling compute based on "Friday/Saturday" peak booking hours. | |
| Security Ops | PII Redaction | Auto-masking customer phones/emails in logs (GDPR/DPDP compliance). |
| Audit Trails | Immutable logs of every staff action (who canceled the booking?). | |
| Quality Ops | Feedback Loop | Auto-flagging low-star ratings for "Manager Review". |
| Bot Monitoring | Tracking "Handover Rate" (when AI fails to understand user). |
2. Monitoring & Observability
2.1 Dashboard Overview
We use a unified Grafana/Datadog dashboard to visualize system health.
- Business Metrics: Real-time Booking Value ($), Active Sessions.
- System Metrics: API Latency (p95), Database CPU, Error Rate (%).
- AI Metrics: Token Usage, Latency per Turn, Fallback Rate.
2.2 Alert Tiers
| Severity | Definition | Alert Channel | Response SLA |
|---|---|---|---|
| SEV-1 (Critical) | Core Booking/Payment flow is down. | PagerDuty (Phone Call) | 15 Mins |
| SEV-2 (High) | Non-critical feature (e.g., Reporting) broken. | Slack (#ops-alerts) | 4 Hours |
| SEV-3 (Low) | Minor UI glitch or non-blocking bug. | Jira Ticket | 24 Hours |
3. On-Call Responsibility Matrix
| Role | Responsibility | Escalation Path |
|---|---|---|
| Primary On-Call | Triage alerts, 15-min response for Sev-1. | -> Secondary On-Call |
| Secondary On-Call | Deep dive complex system failures. | -> CTO / Vendor Support |
| Support Lead | Handle non-technical user escalations. | -> Primary On-Call (if bug confirmed) |
4. Knowledge Management Repository
| Category | Location | Content / Purpose |
|---|---|---|
| Engineering Internal | Notion / Confluence | ADRs, API Specs, Env Configs, Runbooks. |
| User Help Center | Public Docs / Intercom | "How-to" guides, Video tutorials, FAQ. |
5. Continuous Improvement
- Weekly Ops Review: Review of all SEV-1/2 incidents.
- Chaos Engineering: Scheduled "Game Days" to test resilience (e.g., simulate DB failure).