# Multi-GPU Heterogeneous Strategy * Status: accepted * Date: 2025-12-01 * Deciders: Billy Davies * Technical Story: GPU allocation strategy for AI workloads ## Context and Problem Statement The homelab has diverse GPU hardware: - AMD Strix Halo (64GB unified memory) - khelben - NVIDIA RTX 2070 (8GB VRAM) - elminster - AMD Radeon 680M (12GB VRAM) - drizzt - Intel Arc (integrated) - danilo Different AI workloads have different requirements. How do we allocate GPUs effectively? ## Decision Drivers * Maximize utilization of all GPUs * Match workloads to appropriate hardware * Support concurrent inference services * Enable fractional GPU sharing where appropriate * Minimize cross-vendor complexity ## Considered Options * Single GPU vendor only * All workloads on largest GPU * Workload-specific GPU allocation * Dynamic GPU scheduling (MIG/fractional) ## Decision Outcome Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements. ### Allocation Strategy | Workload | GPU | Node | Rationale | |----------|-----|------|-----------| | vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory | | Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory | | XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper | | BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing | | BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization | ### Positive Consequences * Each workload gets optimal hardware * No GPU memory contention for LLM * NVIDIA services can share via time-slicing * Cost-effective use of varied hardware * Clear ownership and debugging ### Negative Consequences * More complex scheduling (node taints/tolerations) * Less flexibility for workload migration * Must maintain multiple GPU driver stacks * Some GPUs underutilized at times ## Implementation ### Node Taints ```yaml # khelben - dedicated vLLM node nodeTaints: dedicated: "vllm:NoSchedule" ``` ### Pod Tolerations and Node Affinity ```yaml # vLLM deployment spec: tolerations: - key: "dedicated" operator: "Equal" value: "vllm" effect: "NoSchedule" affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: ["khelben"] ``` ### Resource Limits ```yaml # NVIDIA GPU (elminster) resources: limits: nvidia.com/gpu: 1 # AMD GPU (drizzt, khelben) resources: limits: amd.com/gpu: 1 ``` ## Pros and Cons of the Options ### Single GPU vendor only * Good, because simpler driver management * Good, because consistent tooling * Bad, because wastes existing hardware * Bad, because higher cost for new hardware ### All workloads on largest GPU * Good, because simple scheduling * Good, because unified memory benefits * Bad, because memory contention * Bad, because single point of failure * Bad, because wastes other GPUs ### Workload-specific allocation (chosen) * Good, because optimal hardware matching * Good, because uses all available GPUs * Good, because clear resource boundaries * Good, because parallel inference * Bad, because more complex configuration * Bad, because multiple driver stacks ### Dynamic GPU scheduling * Good, because flexible * Good, because maximizes utilization * Bad, because complex to implement * Bad, because MIG not available on consumer GPUs * Bad, because cross-vendor scheduling immature ## Links * [Volcano Scheduler](https://volcano.sh) * [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin) * [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) * Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics