Skip to content

Compute & Inference Infrastructure

[CLD-003] Provision EC2 Inference Cluster (Ollama)

Description

Sets up the self-hosted inference nodes running Ollama/vLLM to serve Llama 3 models.

Acceptance Criteria

  • [ ] Instance type: g5.xlarge (NVIDIA A10G, 24GB VRAM) or g4dn.xlarge (T4) for lighter loads.
  • [ ] AMI: NVIDIA Deep Learning AMI (Ubuntu 22.04).
  • [ ] Ollama Service installed and running systemd service.
  • [ ] Model Pulled: llama3:8b-instruct-q5_1 (Quantized for speed/memory balance).
  • [ ] Health Check endpoint (/api/health) returns 200 OK.

[CLD-004] Auto-Scaling for Inference

Description

Configures Auto-Scaling Groups (ASG) to add GPU nodes when inference latency spikes.

Acceptance Criteria

  • [ ] Metric: InferenceQueueDepth (Custom CloudWatch Metric).
  • [ ] Scale Out: If Queue > 50, Add +1 Node (Max 5).
  • [ ] Scale In: If Queue < 5 for 15 mins, Remove Node (Min 1).
  • [ ] Warm Pool configured to reduce boot time (pre-baked AMI).