- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
3.9 KiB
3.9 KiB
Multi-GPU Heterogeneous Strategy
- Status: accepted
- Date: 2025-12-01
- Deciders: Billy Davies
- Technical Story: GPU allocation strategy for AI workloads
Context and Problem Statement
The homelab has diverse GPU hardware:
- AMD Strix Halo (64GB unified memory) - khelben
- NVIDIA RTX 2070 (8GB VRAM) - elminster
- AMD Radeon 680M (12GB VRAM) - drizzt
- Intel Arc (integrated) - danilo
Different AI workloads have different requirements. How do we allocate GPUs effectively?
Decision Drivers
- Maximize utilization of all GPUs
- Match workloads to appropriate hardware
- Support concurrent inference services
- Enable fractional GPU sharing where appropriate
- Minimize cross-vendor complexity
Considered Options
- Single GPU vendor only
- All workloads on largest GPU
- Workload-specific GPU allocation
- Dynamic GPU scheduling (MIG/fractional)
Decision Outcome
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
Allocation Strategy
| Workload | GPU | Node | Rationale |
|---|---|---|---|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
Positive Consequences
- Each workload gets optimal hardware
- No GPU memory contention for LLM
- NVIDIA services can share via time-slicing
- Cost-effective use of varied hardware
- Clear ownership and debugging
Negative Consequences
- More complex scheduling (node taints/tolerations)
- Less flexibility for workload migration
- Must maintain multiple GPU driver stacks
- Some GPUs underutilized at times
Implementation
Node Taints
# khelben - dedicated vLLM node
nodeTaints:
dedicated: "vllm:NoSchedule"
Pod Tolerations and Node Affinity
# vLLM deployment
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "vllm"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["khelben"]
Resource Limits
# NVIDIA GPU (elminster)
resources:
limits:
nvidia.com/gpu: 1
# AMD GPU (drizzt, khelben)
resources:
limits:
amd.com/gpu: 1
Pros and Cons of the Options
Single GPU vendor only
- Good, because simpler driver management
- Good, because consistent tooling
- Bad, because wastes existing hardware
- Bad, because higher cost for new hardware
All workloads on largest GPU
- Good, because simple scheduling
- Good, because unified memory benefits
- Bad, because memory contention
- Bad, because single point of failure
- Bad, because wastes other GPUs
Workload-specific allocation (chosen)
- Good, because optimal hardware matching
- Good, because uses all available GPUs
- Good, because clear resource boundaries
- Good, because parallel inference
- Bad, because more complex configuration
- Bad, because multiple driver stacks
Dynamic GPU scheduling
- Good, because flexible
- Good, because maximizes utilization
- Bad, because complex to implement
- Bad, because MIG not available on consumer GPUs
- Bad, because cross-vendor scheduling immature
Links
- Volcano Scheduler
- AMD GPU Device Plugin
- NVIDIA Device Plugin
- Related: ADR-0002 - GPU drivers via Talos schematics