# Multi-GPU Heterogeneous Strategy

* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: GPU allocation strategy for AI workloads

## Context and Problem Statement

The homelab has diverse GPU hardware:
- AMD Strix Halo (64GB unified memory) - khelben
- NVIDIA RTX 2070 (8GB VRAM) - elminster  
- AMD Radeon 680M (12GB VRAM) - drizzt
- Intel Arc (integrated) - danilo

Different AI workloads have different requirements. How do we allocate GPUs effectively?

## Decision Drivers

* Maximize utilization of all GPUs
* Match workloads to appropriate hardware
* Support concurrent inference services
* Enable fractional GPU sharing where appropriate
* Minimize cross-vendor complexity

## Considered Options

* Single GPU vendor only
* All workloads on largest GPU
* Workload-specific GPU allocation
* Dynamic GPU scheduling (MIG/fractional)

## Decision Outcome

Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.

### Allocation Strategy

| Workload | GPU | Node | Rationale |
|----------|-----|------|-----------|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |

### Positive Consequences

* Each workload gets optimal hardware
* No GPU memory contention for LLM
* NVIDIA services can share via time-slicing
* Cost-effective use of varied hardware
* Clear ownership and debugging

### Negative Consequences

* More complex scheduling (node taints/tolerations)
* Less flexibility for workload migration
* Must maintain multiple GPU driver stacks
* Some GPUs underutilized at times

## Implementation

### Node Taints

```yaml
# khelben - dedicated vLLM node
nodeTaints:
  dedicated: "vllm:NoSchedule"
```

### Pod Tolerations and Node Affinity

```yaml
# vLLM deployment
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "vllm"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values: ["khelben"]
```

### Resource Limits

```yaml
# NVIDIA GPU (elminster)
resources:
  limits:
    nvidia.com/gpu: 1

# AMD GPU (drizzt, khelben)  
resources:
  limits:
    amd.com/gpu: 1
```

## Pros and Cons of the Options

### Single GPU vendor only

* Good, because simpler driver management
* Good, because consistent tooling
* Bad, because wastes existing hardware
* Bad, because higher cost for new hardware

### All workloads on largest GPU

* Good, because simple scheduling
* Good, because unified memory benefits
* Bad, because memory contention
* Bad, because single point of failure
* Bad, because wastes other GPUs

### Workload-specific allocation (chosen)

* Good, because optimal hardware matching
* Good, because uses all available GPUs
* Good, because clear resource boundaries
* Good, because parallel inference
* Bad, because more complex configuration
* Bad, because multiple driver stacks

### Dynamic GPU scheduling

* Good, because flexible
* Good, because maximizes utilization
* Bad, because complex to implement
* Bad, because MIG not available on consumer GPUs
* Bad, because cross-vendor scheduling immature

## Links

* [Volcano Scheduler](https://volcano.sh)
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics