updating to match everything in my homelab.

This commit is contained in:
2026-02-05 16:13:53 -05:00
parent f8787379c5
commit 80fb911e22
30 changed files with 3107 additions and 7 deletions

View File

@@ -0,0 +1,206 @@
# Volcano Batch Scheduling Strategy
* Status: accepted
* Date: 2026-02-05
* Deciders: Billy
* Technical Story: Optimize scheduling for batch ML and analytics workloads
## Context and Problem Statement
The homelab runs diverse workloads including:
- AI/ML training jobs (batch, GPU-intensive)
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
- KubeRay cluster with multiple GPU workers
- Long-running inference services
The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
- Gang scheduling (all-or-nothing pod placement)
- Fair-share queuing across teams/projects
- Preemption policies for priority workloads
- Resource reservation for batch jobs
How do we optimize scheduling for batch and ML workloads?
## Decision Drivers
* Gang scheduling for distributed ML training
* Fair-share resource allocation
* Priority-based preemption
* Integration with Kubeflow and Spark
* GPU-aware scheduling
* Queue management for multi-tenant scenarios
## Considered Options
1. **Volcano Scheduler**
2. **Apache YuniKorn**
3. **Kubernetes default scheduler with Priority Classes**
4. **Kueue (Kubernetes Batch Workload Queueing)**
## Decision Outcome
Chosen option: **Option 1 - Volcano Scheduler**
Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
### Positive Consequences
* Gang scheduling prevents partial deployments
* Queue-based fair-share resource management
* Native Spark and Flink integration
* Preemption for high-priority jobs
* CNCF project with active community
* Coexists with default scheduler
### Negative Consequences
* Additional scheduler components (admission, controller, scheduler)
* Learning curve for queue configuration
* Workloads must opt-in via scheduler name
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Volcano System │
│ (volcano-system namespace) │
│ │
│ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │
│ │ Admission │ │ Controllers │ │ Scheduler │ │
│ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │
│ └─────────────────┘ └───────────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Queues │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ ml-training │ analytics │ inference │ default │ │
│ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Workloads │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │
│ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Configuration
### Queue Definition
```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ml-training
spec:
weight: 40
reclaimable: true
guarantee:
resource:
cpu: "4"
memory: "16Gi"
capability:
resource:
cpu: "32"
memory: "128Gi"
nvidia.com/gpu: "2"
```
### Spark Integration
```yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: analytics-job
spec:
batchScheduler: volcano
batchSchedulerOptions:
queue: analytics
priorityClassName: normal
driver:
schedulerName: volcano
executor:
schedulerName: volcano
instances: 4
```
### Gang Scheduling for ML Training
```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
schedulerName: volcano
minAvailable: 4 # Gang: all 4 pods or none
queue: ml-training
tasks:
- name: worker
replicas: 4
template:
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 1
```
## Queue Structure
| Queue | Weight | Use Case | Guarantee | Preemptable |
|-------|--------|----------|-----------|-------------|
| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
| `default` | 10 | Miscellaneous batch | None | Yes |
## Scheduler Selection
Workloads use Volcano by setting:
```yaml
spec:
schedulerName: volcano
```
Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
## Preemption Policy
```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PriorityClass
metadata:
name: high-priority
spec:
value: 1000
preemptionPolicy: PreemptLowerPriority
description: "High priority ML training jobs"
```
## Monitoring
| Metric | Description |
|--------|-------------|
| `volcano_queue_allocated_*` | Resources currently allocated per queue |
| `volcano_queue_pending_*` | Pending resource requests per queue |
| `volcano_job_status` | Job lifecycle states |
| `volcano_scheduler_throughput` | Scheduling decisions per second |
## Links
* [Volcano Documentation](https://volcano.sh/docs/)
* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
* [Spark on Volcano](https://volcano.sh/docs/spark/)
* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform