Ops & Knowledge Overview

Document Purpose: This document outlines the operational procedures, knowledge management systems, and monitoring infrastructure that adhere to Site Reliability Engineering (SRE) best practices. It empowers the team to maintain high availability and rapid recovery.

Executive Summary

Operational excellence at Cazo is built on three pillars: Observable Systems, Automated Recovery, and Shared Knowledge. We treat operations as a software problem, automating repetitive tasks to focus on reliability.

Incident Response Workflow

We follow a structured incident response process to minimize downtime during critical outages (e.g., Booking System Failure).

stateDiagram-v2
    [*] --> Detected: Alert Triggered (e.g. Booking API 500s)
    Detected --> Acknowledged: On-Call Engineer Acks
    Detected --> Escalated: No Ack in 15m

    Acknowledged --> Investigating: Diagnosis Started
    Escalated --> Investigating: Senior Eng Paged

    Investigating --> Mitigating: Root Cause Found
    Investigating --> Escalated: >30m Resolution Time

    Mitigating --> Resolved: Fix/Rollback Deployed
    Mitigating --> Rollback: Fix Failed
    Rollback --> Mitigating: Retry Alternative

    Resolved --> PostMortem: Required for Sev-1
    PostMortem --> [*]: RCA & Action Items Filed

    note right of Detected
        Sev-1: <15m Response (System Down)
        Sev-2: <4h Response (Feature Broken)
        Sev-3: Next Business Day (Minor Bug)
    end note

1. Operational Capabilities by Use Case

Category	Operational Capability	Description
Core Operations	Tenant Isolation	Ensuring one salon's data never leaks to another (Row-level security).
	Smart Rostering	Auto-scaling compute based on "Friday/Saturday" peak booking hours.
Security Ops	PII Redaction	Auto-masking customer phones/emails in logs (GDPR/DPDP compliance).
	Audit Trails	Immutable logs of every staff action (who canceled the booking?).
Quality Ops	Feedback Loop	Auto-flagging low-star ratings for "Manager Review".
	Bot Monitoring	Tracking "Handover Rate" (when AI fails to understand user).

2. Monitoring & Observability

2.1 Dashboard Overview

We use a unified Grafana/Datadog dashboard to visualize system health.

Business Metrics: Real-time Booking Value ($), Active Sessions.
System Metrics: API Latency (p95), Database CPU, Error Rate (%).
AI Metrics: Token Usage, Latency per Turn, Fallback Rate.

2.2 Alert Tiers

Severity	Definition	Alert Channel	Response SLA
SEV-1 (Critical)	Core Booking/Payment flow is down.	PagerDuty (Phone Call)	15 Mins
SEV-2 (High)	Non-critical feature (e.g., Reporting) broken.	Slack (#ops-alerts)	4 Hours
SEV-3 (Low)	Minor UI glitch or non-blocking bug.	Jira Ticket	24 Hours

3. On-Call Responsibility Matrix

Role	Responsibility	Escalation Path
Primary On-Call	Triage alerts, 15-min response for Sev-1.	-> Secondary On-Call
Secondary On-Call	Deep dive complex system failures.	-> CTO / Vendor Support
Support Lead	Handle non-technical user escalations.	-> Primary On-Call (if bug confirmed)

4. Knowledge Management Repository

Category	Location	Content / Purpose
Engineering Internal	Notion / Confluence	ADRs, API Specs, Env Configs, Runbooks.
User Help Center	Public Docs / Intercom	"How-to" guides, Video tutorials, FAQ.

5. Continuous Improvement

Weekly Ops Review: Review of all SEV-1/2 incidents.
Chaos Engineering: Scheduled "Game Days" to test resilience (e.g., simulate DB failure).