updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00
parent f8787379c5
commit 80fb911e22
30 changed files with 3107 additions and 7 deletions
--- a/decisions/0034-volcano-batch-scheduling.md
+++ b/decisions/0034-volcano-batch-scheduling.md
@@ -0,0 +1,206 @@
+# Volcano Batch Scheduling Strategy
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Optimize scheduling for batch ML and analytics workloads
+
+## Context and Problem Statement
+
+The homelab runs diverse workloads including:
+- AI/ML training jobs (batch, GPU-intensive)
+- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
+- KubeRay cluster with multiple GPU workers
+- Long-running inference services
+
+The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
+- Gang scheduling (all-or-nothing pod placement)
+- Fair-share queuing across teams/projects
+- Preemption policies for priority workloads
+- Resource reservation for batch jobs
+
+How do we optimize scheduling for batch and ML workloads?
+
+## Decision Drivers
+
+* Gang scheduling for distributed ML training
+* Fair-share resource allocation
+* Priority-based preemption
+* Integration with Kubeflow and Spark
+* GPU-aware scheduling
+* Queue management for multi-tenant scenarios
+
+## Considered Options
+
+1. **Volcano Scheduler**
+2. **Apache YuniKorn**
+3. **Kubernetes default scheduler with Priority Classes**
+4. **Kueue (Kubernetes Batch Workload Queueing)**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Volcano Scheduler**
+
+Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
+
+### Positive Consequences
+
+* Gang scheduling prevents partial deployments
+* Queue-based fair-share resource management
+* Native Spark and Flink integration
+* Preemption for high-priority jobs
+* CNCF project with active community
+* Coexists with default scheduler
+
+### Negative Consequences
+
+* Additional scheduler components (admission, controller, scheduler)
+* Learning curve for queue configuration
+* Workloads must opt-in via scheduler name
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        Volcano System                            │
+│                     (volcano-system namespace)                   │
+│                                                                  │
+│  ┌─────────────────┐  ┌───────────────────┐  ┌───────────────┐  │
+│  │   Admission     │  │   Controllers     │  │   Scheduler   │  │
+│  │   Webhook       │  │   (Job lifecycle) │  │   (Placement) │  │
+│  └─────────────────┘  └───────────────────┘  └───────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                         Queues                                   │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │  ml-training    │  analytics    │  inference   │  default  │  │
+│  │  weight: 40     │  weight: 30   │  weight: 20  │  weight: 10│ │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                        Workloads                                 │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
+│  │ Spark Jobs   │  │ Flink Jobs   │  │ ML Training (KFP)    │   │
+│  │ (analytics)  │  │ (analytics)  │  │ (ml-training)        │   │
+│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Configuration
+
+### Queue Definition
+
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: Queue
+metadata:
+  name: ml-training
+spec:
+  weight: 40
+  reclaimable: true
+  guarantee:
+    resource:
+      cpu: "4"
+      memory: "16Gi"
+  capability:
+    resource:
+      cpu: "32"
+      memory: "128Gi"
+      nvidia.com/gpu: "2"
+```
+
+### Spark Integration
+
+```yaml
+apiVersion: sparkoperator.k8s.io/v1beta2
+kind: SparkApplication
+metadata:
+  name: analytics-job
+spec:
+  batchScheduler: volcano
+  batchSchedulerOptions:
+    queue: analytics
+    priorityClassName: normal
+  driver:
+    schedulerName: volcano
+  executor:
+    schedulerName: volcano
+    instances: 4
+```
+
+### Gang Scheduling for ML Training
+
+```yaml
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: distributed-training
+spec:
+  schedulerName: volcano
+  minAvailable: 4  # Gang: all 4 pods or none
+  queue: ml-training
+  tasks:
+    - name: worker
+      replicas: 4
+      template:
+        spec:
+          containers:
+            - name: trainer
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+```
+
+## Queue Structure
+
+| Queue | Weight | Use Case | Guarantee | Preemptable |
+|-------|--------|----------|-----------|-------------|
+| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
+| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
+| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
+| `default` | 10 | Miscellaneous batch | None | Yes |
+
+## Scheduler Selection
+
+Workloads use Volcano by setting:
+
+```yaml
+spec:
+  schedulerName: volcano
+```
+
+Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
+
+## Preemption Policy
+
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: PriorityClass
+metadata:
+  name: high-priority
+spec:
+  value: 1000
+  preemptionPolicy: PreemptLowerPriority
+  description: "High priority ML training jobs"
+```
+
+## Monitoring
+
+| Metric | Description |
+|--------|-------------|
+| `volcano_queue_allocated_*` | Resources currently allocated per queue |
+| `volcano_queue_pending_*` | Pending resource requests per queue |
+| `volcano_job_status` | Job lifecycle states |
+| `volcano_scheduler_throughput` | Scheduling decisions per second |
+
+## Links
+
+* [Volcano Documentation](https://volcano.sh/docs/)
+* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
+* [Spark on Volcano](https://volcano.sh/docs/spark/)
+* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
+* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform