Compute & Inference Infrastructure
[CLD-003] Provision EC2 Inference Cluster (Ollama)
Description
Sets up the self-hosted inference nodes running Ollama/vLLM to serve Llama 3 models.
Acceptance Criteria
- [ ] Instance type:
g5.xlarge(NVIDIA A10G, 24GB VRAM) org4dn.xlarge(T4) for lighter loads. - [ ] AMI: NVIDIA Deep Learning AMI (Ubuntu 22.04).
- [ ] Ollama Service installed and running systemd service.
- [ ] Model Pulled:
llama3:8b-instruct-q5_1(Quantized for speed/memory balance). - [ ] Health Check endpoint (
/api/health) returns 200 OK.
[CLD-004] Auto-Scaling for Inference
Description
Configures Auto-Scaling Groups (ASG) to add GPU nodes when inference latency spikes.
Acceptance Criteria
- [ ] Metric:
InferenceQueueDepth(Custom CloudWatch Metric). - [ ] Scale Out: If Queue > 50, Add +1 Node (Max 5).
- [ ] Scale In: If Queue < 5 for 15 mins, Remove Node (Min 1).
- [ ] Warm Pool configured to reduce boot time (pre-baked AMI).