Files
homelab-design/decisions/0005-multi-gpu-strategy.md
Billy D. 832cda34bd feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00

3.9 KiB

Multi-GPU Heterogeneous Strategy

  • Status: accepted
  • Date: 2025-12-01
  • Deciders: Billy Davies
  • Technical Story: GPU allocation strategy for AI workloads

Context and Problem Statement

The homelab has diverse GPU hardware:

  • AMD Strix Halo (64GB unified memory) - khelben
  • NVIDIA RTX 2070 (8GB VRAM) - elminster
  • AMD Radeon 680M (12GB VRAM) - drizzt
  • Intel Arc (integrated) - danilo

Different AI workloads have different requirements. How do we allocate GPUs effectively?

Decision Drivers

  • Maximize utilization of all GPUs
  • Match workloads to appropriate hardware
  • Support concurrent inference services
  • Enable fractional GPU sharing where appropriate
  • Minimize cross-vendor complexity

Considered Options

  • Single GPU vendor only
  • All workloads on largest GPU
  • Workload-specific GPU allocation
  • Dynamic GPU scheduling (MIG/fractional)

Decision Outcome

Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.

Allocation Strategy

Workload GPU Node Rationale
vLLM (LLM inference) AMD Strix Halo (64GB) khelben (dedicated) Large models need unified memory
Whisper (STT) NVIDIA RTX 2070 (8GB) elminster CUDA optimized, medium memory
XTTS (TTS) NVIDIA RTX 2070 (8GB) elminster Shares with Whisper
BGE Embeddings AMD Radeon 680M (12GB) drizzt ROCm support, batch processing
BGE Reranker Intel Arc danilo Light workload, Intel optimization

Positive Consequences

  • Each workload gets optimal hardware
  • No GPU memory contention for LLM
  • NVIDIA services can share via time-slicing
  • Cost-effective use of varied hardware
  • Clear ownership and debugging

Negative Consequences

  • More complex scheduling (node taints/tolerations)
  • Less flexibility for workload migration
  • Must maintain multiple GPU driver stacks
  • Some GPUs underutilized at times

Implementation

Node Taints

# khelben - dedicated vLLM node
nodeTaints:
  dedicated: "vllm:NoSchedule"

Pod Tolerations and Node Affinity

# vLLM deployment
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "vllm"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values: ["khelben"]

Resource Limits

# NVIDIA GPU (elminster)
resources:
  limits:
    nvidia.com/gpu: 1

# AMD GPU (drizzt, khelben)  
resources:
  limits:
    amd.com/gpu: 1

Pros and Cons of the Options

Single GPU vendor only

  • Good, because simpler driver management
  • Good, because consistent tooling
  • Bad, because wastes existing hardware
  • Bad, because higher cost for new hardware

All workloads on largest GPU

  • Good, because simple scheduling
  • Good, because unified memory benefits
  • Bad, because memory contention
  • Bad, because single point of failure
  • Bad, because wastes other GPUs

Workload-specific allocation (chosen)

  • Good, because optimal hardware matching
  • Good, because uses all available GPUs
  • Good, because clear resource boundaries
  • Good, because parallel inference
  • Bad, because more complex configuration
  • Bad, because multiple driver stacks

Dynamic GPU scheduling

  • Good, because flexible
  • Good, because maximizes utilization
  • Bad, because complex to implement
  • Bad, because MIG not available on consumer GPUs
  • Bad, because cross-vendor scheduling immature