Skip to content

Cazo AI Documentation

Compute & Inference

alphapebble/cazo-docs

Compute & Inference Infrastructure

[CLD-003] Provision EC2 Inference Cluster (Ollama)

Description

Sets up the self-hosted inference nodes running Ollama/vLLM to serve Llama 3 models.

Acceptance Criteria

[ ] Instance type: g5.xlarge (NVIDIA A10G, 24GB VRAM) or g4dn.xlarge (T4) for lighter loads.
[ ] AMI: NVIDIA Deep Learning AMI (Ubuntu 22.04).
[ ] Ollama Service installed and running systemd service.
[ ] Model Pulled: llama3:8b-instruct-q5_1 (Quantized for speed/memory balance).
[ ] Health Check endpoint (/api/health) returns 200 OK.

[CLD-004] Auto-Scaling for Inference

Description

Configures Auto-Scaling Groups (ASG) to add GPU nodes when inference latency spikes.

Acceptance Criteria

[ ] Metric: InferenceQueueDepth (Custom CloudWatch Metric).
[ ] Scale Out: If Queue > 50, Add +1 Node (Max 5).
[ ] Scale In: If Queue < 5 for 15 mins, Remove Node (Min 1).
[ ] Warm Pool configured to reduce boot time (pre-baked AMI).