Files

Billy D. 832cda34bd feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.

2026-02-01 14:30:05 -05:00

3.9 KiB

Raw Blame History

Multi-GPU Heterogeneous Strategy

Status: accepted
Date: 2025-12-01
Deciders: Billy Davies
Technical Story: GPU allocation strategy for AI workloads

Context and Problem Statement

The homelab has diverse GPU hardware:

AMD Strix Halo (64GB unified memory) - khelben
NVIDIA RTX 2070 (8GB VRAM) - elminster
AMD Radeon 680M (12GB VRAM) - drizzt
Intel Arc (integrated) - danilo

Different AI workloads have different requirements. How do we allocate GPUs effectively?

Decision Drivers

Maximize utilization of all GPUs
Match workloads to appropriate hardware
Support concurrent inference services
Enable fractional GPU sharing where appropriate
Minimize cross-vendor complexity

Considered Options

Single GPU vendor only
All workloads on largest GPU
Workload-specific GPU allocation
Dynamic GPU scheduling (MIG/fractional)

Decision Outcome

Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.

Allocation Strategy

Workload	GPU	Node	Rationale
vLLM (LLM inference)	AMD Strix Halo (64GB)	khelben (dedicated)	Large models need unified memory
Whisper (STT)	NVIDIA RTX 2070 (8GB)	elminster	CUDA optimized, medium memory
XTTS (TTS)	NVIDIA RTX 2070 (8GB)	elminster	Shares with Whisper
BGE Embeddings	AMD Radeon 680M (12GB)	drizzt	ROCm support, batch processing
BGE Reranker	Intel Arc	danilo	Light workload, Intel optimization

Positive Consequences

Each workload gets optimal hardware
No GPU memory contention for LLM
NVIDIA services can share via time-slicing
Cost-effective use of varied hardware
Clear ownership and debugging

Negative Consequences

More complex scheduling (node taints/tolerations)
Less flexibility for workload migration
Must maintain multiple GPU driver stacks
Some GPUs underutilized at times

Implementation

Node Taints

# khelben - dedicated vLLM node
nodeTaints:
  dedicated: "vllm:NoSchedule"

Pod Tolerations and Node Affinity

# vLLM deployment
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "vllm"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values: ["khelben"]

Resource Limits

# NVIDIA GPU (elminster)
resources:
  limits:
    nvidia.com/gpu: 1

# AMD GPU (drizzt, khelben)  
resources:
  limits:
    amd.com/gpu: 1

Pros and Cons of the Options

Single GPU vendor only

Good, because simpler driver management
Good, because consistent tooling
Bad, because wastes existing hardware
Bad, because higher cost for new hardware

All workloads on largest GPU

Good, because simple scheduling
Good, because unified memory benefits
Bad, because memory contention
Bad, because single point of failure
Bad, because wastes other GPUs

Workload-specific allocation (chosen)

Good, because optimal hardware matching
Good, because uses all available GPUs
Good, because clear resource boundaries
Good, because parallel inference
Bad, because more complex configuration
Bad, because multiple driver stacks

Dynamic GPU scheduling

Good, because flexible
Good, because maximizes utilization
Bad, because complex to implement
Bad, because MIG not available on consumer GPUs
Bad, because cross-vendor scheduling immature

3.9 KiB Raw Blame History