Files
homelab-design/decisions/0005-multi-gpu-strategy.md
Billy D. 832cda34bd feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00

146 lines
3.9 KiB
Markdown

# Multi-GPU Heterogeneous Strategy
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: GPU allocation strategy for AI workloads
## Context and Problem Statement
The homelab has diverse GPU hardware:
- AMD Strix Halo (64GB unified memory) - khelben
- NVIDIA RTX 2070 (8GB VRAM) - elminster
- AMD Radeon 680M (12GB VRAM) - drizzt
- Intel Arc (integrated) - danilo
Different AI workloads have different requirements. How do we allocate GPUs effectively?
## Decision Drivers
* Maximize utilization of all GPUs
* Match workloads to appropriate hardware
* Support concurrent inference services
* Enable fractional GPU sharing where appropriate
* Minimize cross-vendor complexity
## Considered Options
* Single GPU vendor only
* All workloads on largest GPU
* Workload-specific GPU allocation
* Dynamic GPU scheduling (MIG/fractional)
## Decision Outcome
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
### Allocation Strategy
| Workload | GPU | Node | Rationale |
|----------|-----|------|-----------|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
### Positive Consequences
* Each workload gets optimal hardware
* No GPU memory contention for LLM
* NVIDIA services can share via time-slicing
* Cost-effective use of varied hardware
* Clear ownership and debugging
### Negative Consequences
* More complex scheduling (node taints/tolerations)
* Less flexibility for workload migration
* Must maintain multiple GPU driver stacks
* Some GPUs underutilized at times
## Implementation
### Node Taints
```yaml
# khelben - dedicated vLLM node
nodeTaints:
dedicated: "vllm:NoSchedule"
```
### Pod Tolerations and Node Affinity
```yaml
# vLLM deployment
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "vllm"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["khelben"]
```
### Resource Limits
```yaml
# NVIDIA GPU (elminster)
resources:
limits:
nvidia.com/gpu: 1
# AMD GPU (drizzt, khelben)
resources:
limits:
amd.com/gpu: 1
```
## Pros and Cons of the Options
### Single GPU vendor only
* Good, because simpler driver management
* Good, because consistent tooling
* Bad, because wastes existing hardware
* Bad, because higher cost for new hardware
### All workloads on largest GPU
* Good, because simple scheduling
* Good, because unified memory benefits
* Bad, because memory contention
* Bad, because single point of failure
* Bad, because wastes other GPUs
### Workload-specific allocation (chosen)
* Good, because optimal hardware matching
* Good, because uses all available GPUs
* Good, because clear resource boundaries
* Good, because parallel inference
* Bad, because more complex configuration
* Bad, because multiple driver stacks
### Dynamic GPU scheduling
* Good, because flexible
* Good, because maximizes utilization
* Bad, because complex to implement
* Bad, because MIG not available on consumer GPUs
* Bad, because cross-vendor scheduling immature
## Links
* [Volcano Scheduler](https://volcano.sh)
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics