Skip to content

Ops & Knowledge Overview

Document Purpose: This document outlines the operational procedures, knowledge management systems, and monitoring infrastructure that adhere to Site Reliability Engineering (SRE) best practices. It empowers the team to maintain high availability and rapid recovery.

Executive Summary

Operational excellence at Cazo is built on three pillars: Observable Systems, Automated Recovery, and Shared Knowledge. We treat operations as a software problem, automating repetitive tasks to focus on reliability.

Incident Response Workflow

We follow a structured incident response process to minimize downtime during critical outages (e.g., Booking System Failure).

stateDiagram-v2
    [*] --> Detected: Alert Triggered (e.g. Booking API 500s)
    Detected --> Acknowledged: On-Call Engineer Acks
    Detected --> Escalated: No Ack in 15m

    Acknowledged --> Investigating: Diagnosis Started
    Escalated --> Investigating: Senior Eng Paged

    Investigating --> Mitigating: Root Cause Found
    Investigating --> Escalated: >30m Resolution Time

    Mitigating --> Resolved: Fix/Rollback Deployed
    Mitigating --> Rollback: Fix Failed
    Rollback --> Mitigating: Retry Alternative

    Resolved --> PostMortem: Required for Sev-1
    PostMortem --> [*]: RCA & Action Items Filed

    note right of Detected
        Sev-1: <15m Response (System Down)
        Sev-2: <4h Response (Feature Broken)
        Sev-3: Next Business Day (Minor Bug)
    end note

1. Operational Capabilities by Use Case

Category Operational Capability Description
Core Operations Tenant Isolation Ensuring one salon's data never leaks to another (Row-level security).
Smart Rostering Auto-scaling compute based on "Friday/Saturday" peak booking hours.
Security Ops PII Redaction Auto-masking customer phones/emails in logs (GDPR/DPDP compliance).
Audit Trails Immutable logs of every staff action (who canceled the booking?).
Quality Ops Feedback Loop Auto-flagging low-star ratings for "Manager Review".
Bot Monitoring Tracking "Handover Rate" (when AI fails to understand user).

2. Monitoring & Observability

2.1 Dashboard Overview

We use a unified Grafana/Datadog dashboard to visualize system health.

  • Business Metrics: Real-time Booking Value ($), Active Sessions.
  • System Metrics: API Latency (p95), Database CPU, Error Rate (%).
  • AI Metrics: Token Usage, Latency per Turn, Fallback Rate.

2.2 Alert Tiers

Severity Definition Alert Channel Response SLA
SEV-1 (Critical) Core Booking/Payment flow is down. PagerDuty (Phone Call) 15 Mins
SEV-2 (High) Non-critical feature (e.g., Reporting) broken. Slack (#ops-alerts) 4 Hours
SEV-3 (Low) Minor UI glitch or non-blocking bug. Jira Ticket 24 Hours

3. On-Call Responsibility Matrix

Role Responsibility Escalation Path
Primary On-Call Triage alerts, 15-min response for Sev-1. -> Secondary On-Call
Secondary On-Call Deep dive complex system failures. -> CTO / Vendor Support
Support Lead Handle non-technical user escalations. -> Primary On-Call (if bug confirmed)

4. Knowledge Management Repository

Category Location Content / Purpose
Engineering Internal Notion / Confluence ADRs, API Specs, Env Configs, Runbooks.
User Help Center Public Docs / Intercom "How-to" guides, Video tutorials, FAQ.

5. Continuous Improvement

  • Weekly Ops Review: Review of all SEV-1/2 incidents.
  • Chaos Engineering: Scheduled "Game Days" to test resilience (e.g., simulate DB failure).