- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
146 lines
3.9 KiB
Markdown
146 lines
3.9 KiB
Markdown
# Multi-GPU Heterogeneous Strategy
|
|
|
|
* Status: accepted
|
|
* Date: 2025-12-01
|
|
* Deciders: Billy Davies
|
|
* Technical Story: GPU allocation strategy for AI workloads
|
|
|
|
## Context and Problem Statement
|
|
|
|
The homelab has diverse GPU hardware:
|
|
- AMD Strix Halo (64GB unified memory) - khelben
|
|
- NVIDIA RTX 2070 (8GB VRAM) - elminster
|
|
- AMD Radeon 680M (12GB VRAM) - drizzt
|
|
- Intel Arc (integrated) - danilo
|
|
|
|
Different AI workloads have different requirements. How do we allocate GPUs effectively?
|
|
|
|
## Decision Drivers
|
|
|
|
* Maximize utilization of all GPUs
|
|
* Match workloads to appropriate hardware
|
|
* Support concurrent inference services
|
|
* Enable fractional GPU sharing where appropriate
|
|
* Minimize cross-vendor complexity
|
|
|
|
## Considered Options
|
|
|
|
* Single GPU vendor only
|
|
* All workloads on largest GPU
|
|
* Workload-specific GPU allocation
|
|
* Dynamic GPU scheduling (MIG/fractional)
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
|
|
|
|
### Allocation Strategy
|
|
|
|
| Workload | GPU | Node | Rationale |
|
|
|----------|-----|------|-----------|
|
|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
|
|
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
|
|
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
|
|
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
|
|
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
|
|
|
|
### Positive Consequences
|
|
|
|
* Each workload gets optimal hardware
|
|
* No GPU memory contention for LLM
|
|
* NVIDIA services can share via time-slicing
|
|
* Cost-effective use of varied hardware
|
|
* Clear ownership and debugging
|
|
|
|
### Negative Consequences
|
|
|
|
* More complex scheduling (node taints/tolerations)
|
|
* Less flexibility for workload migration
|
|
* Must maintain multiple GPU driver stacks
|
|
* Some GPUs underutilized at times
|
|
|
|
## Implementation
|
|
|
|
### Node Taints
|
|
|
|
```yaml
|
|
# khelben - dedicated vLLM node
|
|
nodeTaints:
|
|
dedicated: "vllm:NoSchedule"
|
|
```
|
|
|
|
### Pod Tolerations and Node Affinity
|
|
|
|
```yaml
|
|
# vLLM deployment
|
|
spec:
|
|
tolerations:
|
|
- key: "dedicated"
|
|
operator: "Equal"
|
|
value: "vllm"
|
|
effect: "NoSchedule"
|
|
affinity:
|
|
nodeAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
nodeSelectorTerms:
|
|
- matchExpressions:
|
|
- key: kubernetes.io/hostname
|
|
operator: In
|
|
values: ["khelben"]
|
|
```
|
|
|
|
### Resource Limits
|
|
|
|
```yaml
|
|
# NVIDIA GPU (elminster)
|
|
resources:
|
|
limits:
|
|
nvidia.com/gpu: 1
|
|
|
|
# AMD GPU (drizzt, khelben)
|
|
resources:
|
|
limits:
|
|
amd.com/gpu: 1
|
|
```
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Single GPU vendor only
|
|
|
|
* Good, because simpler driver management
|
|
* Good, because consistent tooling
|
|
* Bad, because wastes existing hardware
|
|
* Bad, because higher cost for new hardware
|
|
|
|
### All workloads on largest GPU
|
|
|
|
* Good, because simple scheduling
|
|
* Good, because unified memory benefits
|
|
* Bad, because memory contention
|
|
* Bad, because single point of failure
|
|
* Bad, because wastes other GPUs
|
|
|
|
### Workload-specific allocation (chosen)
|
|
|
|
* Good, because optimal hardware matching
|
|
* Good, because uses all available GPUs
|
|
* Good, because clear resource boundaries
|
|
* Good, because parallel inference
|
|
* Bad, because more complex configuration
|
|
* Bad, because multiple driver stacks
|
|
|
|
### Dynamic GPU scheduling
|
|
|
|
* Good, because flexible
|
|
* Good, because maximizes utilization
|
|
* Bad, because complex to implement
|
|
* Bad, because MIG not available on consumer GPUs
|
|
* Bad, because cross-vendor scheduling immature
|
|
|
|
## Links
|
|
|
|
* [Volcano Scheduler](https://volcano.sh)
|
|
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
|
|
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
|
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
|