feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
145
decisions/0005-multi-gpu-strategy.md
Normal file
145
decisions/0005-multi-gpu-strategy.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Multi-GPU Heterogeneous Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: GPU allocation strategy for AI workloads
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab has diverse GPU hardware:
|
||||
- AMD Strix Halo (64GB unified memory) - khelben
|
||||
- NVIDIA RTX 2070 (8GB VRAM) - elminster
|
||||
- AMD Radeon 680M (12GB VRAM) - drizzt
|
||||
- Intel Arc (integrated) - danilo
|
||||
|
||||
Different AI workloads have different requirements. How do we allocate GPUs effectively?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Maximize utilization of all GPUs
|
||||
* Match workloads to appropriate hardware
|
||||
* Support concurrent inference services
|
||||
* Enable fractional GPU sharing where appropriate
|
||||
* Minimize cross-vendor complexity
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Single GPU vendor only
|
||||
* All workloads on largest GPU
|
||||
* Workload-specific GPU allocation
|
||||
* Dynamic GPU scheduling (MIG/fractional)
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
|
||||
|
||||
### Allocation Strategy
|
||||
|
||||
| Workload | GPU | Node | Rationale |
|
||||
|----------|-----|------|-----------|
|
||||
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
|
||||
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
|
||||
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
|
||||
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
|
||||
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Each workload gets optimal hardware
|
||||
* No GPU memory contention for LLM
|
||||
* NVIDIA services can share via time-slicing
|
||||
* Cost-effective use of varied hardware
|
||||
* Clear ownership and debugging
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* More complex scheduling (node taints/tolerations)
|
||||
* Less flexibility for workload migration
|
||||
* Must maintain multiple GPU driver stacks
|
||||
* Some GPUs underutilized at times
|
||||
|
||||
## Implementation
|
||||
|
||||
### Node Taints
|
||||
|
||||
```yaml
|
||||
# khelben - dedicated vLLM node
|
||||
nodeTaints:
|
||||
dedicated: "vllm:NoSchedule"
|
||||
```
|
||||
|
||||
### Pod Tolerations and Node Affinity
|
||||
|
||||
```yaml
|
||||
# vLLM deployment
|
||||
spec:
|
||||
tolerations:
|
||||
- key: "dedicated"
|
||||
operator: "Equal"
|
||||
value: "vllm"
|
||||
effect: "NoSchedule"
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values: ["khelben"]
|
||||
```
|
||||
|
||||
### Resource Limits
|
||||
|
||||
```yaml
|
||||
# NVIDIA GPU (elminster)
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
|
||||
# AMD GPU (drizzt, khelben)
|
||||
resources:
|
||||
limits:
|
||||
amd.com/gpu: 1
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Single GPU vendor only
|
||||
|
||||
* Good, because simpler driver management
|
||||
* Good, because consistent tooling
|
||||
* Bad, because wastes existing hardware
|
||||
* Bad, because higher cost for new hardware
|
||||
|
||||
### All workloads on largest GPU
|
||||
|
||||
* Good, because simple scheduling
|
||||
* Good, because unified memory benefits
|
||||
* Bad, because memory contention
|
||||
* Bad, because single point of failure
|
||||
* Bad, because wastes other GPUs
|
||||
|
||||
### Workload-specific allocation (chosen)
|
||||
|
||||
* Good, because optimal hardware matching
|
||||
* Good, because uses all available GPUs
|
||||
* Good, because clear resource boundaries
|
||||
* Good, because parallel inference
|
||||
* Bad, because more complex configuration
|
||||
* Bad, because multiple driver stacks
|
||||
|
||||
### Dynamic GPU scheduling
|
||||
|
||||
* Good, because flexible
|
||||
* Good, because maximizes utilization
|
||||
* Bad, because complex to implement
|
||||
* Bad, because MIG not available on consumer GPUs
|
||||
* Bad, because cross-vendor scheduling immature
|
||||
|
||||
## Links
|
||||
|
||||
* [Volcano Scheduler](https://volcano.sh)
|
||||
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
|
||||
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
|
||||
Reference in New Issue
Block a user