7.7 KiB
7.7 KiB
Volcano Batch Scheduling Strategy
- Status: accepted
- Date: 2026-02-05
- Deciders: Billy
- Technical Story: Optimize scheduling for batch ML and analytics workloads
Context and Problem Statement
The homelab runs diverse workloads including:
- AI/ML training jobs (batch, GPU-intensive)
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
- KubeRay cluster with multiple GPU workers
- Long-running inference services
The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
- Gang scheduling (all-or-nothing pod placement)
- Fair-share queuing across teams/projects
- Preemption policies for priority workloads
- Resource reservation for batch jobs
How do we optimize scheduling for batch and ML workloads?
Decision Drivers
- Gang scheduling for distributed ML training
- Fair-share resource allocation
- Priority-based preemption
- Integration with Kubeflow and Spark
- GPU-aware scheduling
- Queue management for multi-tenant scenarios
Considered Options
- Volcano Scheduler
- Apache YuniKorn
- Kubernetes default scheduler with Priority Classes
- Kueue (Kubernetes Batch Workload Queueing)
Decision Outcome
Chosen option: Option 1 - Volcano Scheduler
Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
Positive Consequences
- Gang scheduling prevents partial deployments
- Queue-based fair-share resource management
- Native Spark and Flink integration
- Preemption for high-priority jobs
- CNCF project with active community
- Coexists with default scheduler
Negative Consequences
- Additional scheduler components (admission, controller, scheduler)
- Learning curve for queue configuration
- Workloads must opt-in via scheduler name
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Volcano System │
│ (volcano-system namespace) │
│ │
│ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │
│ │ Admission │ │ Controllers │ │ Scheduler │ │
│ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │
│ └─────────────────┘ └───────────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Queues │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ ml-training │ analytics │ inference │ default │ │
│ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Workloads │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │
│ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Configuration
Queue Definition
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ml-training
spec:
weight: 40
reclaimable: true
guarantee:
resource:
cpu: "4"
memory: "16Gi"
capability:
resource:
cpu: "32"
memory: "128Gi"
nvidia.com/gpu: "2"
Spark Integration
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: analytics-job
spec:
batchScheduler: volcano
batchSchedulerOptions:
queue: analytics
priorityClassName: normal
driver:
schedulerName: volcano
executor:
schedulerName: volcano
instances: 4
Gang Scheduling for ML Training
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
schedulerName: volcano
minAvailable: 4 # Gang: all 4 pods or none
queue: ml-training
tasks:
- name: worker
replicas: 4
template:
spec:
containers:
- name: trainer
resources:
limits:
nvidia.com/gpu: 1
Queue Structure
| Queue | Weight | Use Case | Guarantee | Preemptable |
|---|---|---|---|---|
ml-training |
40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
analytics |
30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
inference |
20 | Batch inference jobs | 2 CPU, 8Gi | No |
default |
10 | Miscellaneous batch | None | Yes |
Scheduler Selection
Workloads use Volcano by setting:
spec:
schedulerName: volcano
Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
Preemption Policy
apiVersion: scheduling.volcano.sh/v1beta1
kind: PriorityClass
metadata:
name: high-priority
spec:
value: 1000
preemptionPolicy: PreemptLowerPriority
description: "High priority ML training jobs"
Monitoring
| Metric | Description |
|---|---|
volcano_queue_allocated_* |
Resources currently allocated per queue |
volcano_queue_pending_* |
Pending resource requests per queue |
volcano_job_status |
Job lifecycle states |
volcano_scheduler_throughput |
Scheduling decisions per second |
Links
- Volcano Documentation
- Gang Scheduling
- Spark on Volcano
- Related: ADR-0009 - Dual Workflow Engines
- Related: ADR-0033 - Data Analytics Platform