# Volcano Batch Scheduling Strategy * Status: accepted * Date: 2026-02-05 * Deciders: Billy * Technical Story: Optimize scheduling for batch ML and analytics workloads ## Context and Problem Statement The homelab runs diverse workloads including: - AI/ML training jobs (batch, GPU-intensive) - Spark/Flink analytics jobs (batch, CPU/memory-intensive) - KubeRay cluster with multiple GPU workers - Long-running inference services The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks: - Gang scheduling (all-or-nothing pod placement) - Fair-share queuing across teams/projects - Preemption policies for priority workloads - Resource reservation for batch jobs How do we optimize scheduling for batch and ML workloads? ## Decision Drivers * Gang scheduling for distributed ML training * Fair-share resource allocation * Priority-based preemption * Integration with Kubeflow and Spark * GPU-aware scheduling * Queue management for multi-tenant scenarios ## Considered Options 1. **Volcano Scheduler** 2. **Apache YuniKorn** 3. **Kubernetes default scheduler with Priority Classes** 4. **Kueue (Kubernetes Batch Workload Queueing)** ## Decision Outcome Chosen option: **Option 1 - Volcano Scheduler** Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks. ### Positive Consequences * Gang scheduling prevents partial deployments * Queue-based fair-share resource management * Native Spark and Flink integration * Preemption for high-priority jobs * CNCF project with active community * Coexists with default scheduler ### Negative Consequences * Additional scheduler components (admission, controller, scheduler) * Learning curve for queue configuration * Workloads must opt-in via scheduler name ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Volcano System │ │ (volcano-system namespace) │ │ │ │ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │ │ │ Admission │ │ Controllers │ │ Scheduler │ │ │ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │ │ └─────────────────┘ └───────────────────┘ └───────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Queues │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ ml-training │ analytics │ inference │ default │ │ │ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │ │ └───────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Workloads │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │ │ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Configuration ### Queue Definition ```yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: ml-training spec: weight: 40 reclaimable: true guarantee: resource: cpu: "4" memory: "16Gi" capability: resource: cpu: "32" memory: "128Gi" nvidia.com/gpu: "2" ``` ### Spark Integration ```yaml apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: analytics-job spec: batchScheduler: volcano batchSchedulerOptions: queue: analytics priorityClassName: normal driver: schedulerName: volcano executor: schedulerName: volcano instances: 4 ``` ### Gang Scheduling for ML Training ```yaml apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: distributed-training spec: schedulerName: volcano minAvailable: 4 # Gang: all 4 pods or none queue: ml-training tasks: - name: worker replicas: 4 template: spec: containers: - name: trainer resources: limits: nvidia.com/gpu: 1 ``` ## Queue Structure | Queue | Weight | Use Case | Guarantee | Preemptable | |-------|--------|----------|-----------|-------------| | `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No | | `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes | | `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No | | `default` | 10 | Miscellaneous batch | None | Yes | ## Scheduler Selection Workloads use Volcano by setting: ```yaml spec: schedulerName: volcano ``` Long-running services (inference endpoints, databases) continue using the default scheduler for stability. ## Preemption Policy ```yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: PriorityClass metadata: name: high-priority spec: value: 1000 preemptionPolicy: PreemptLowerPriority description: "High priority ML training jobs" ``` ## Monitoring | Metric | Description | |--------|-------------| | `volcano_queue_allocated_*` | Resources currently allocated per queue | | `volcano_queue_pending_*` | Pending resource requests per queue | | `volcano_job_status` | Job lifecycle states | | `volcano_scheduler_throughput` | Scheduling decisions per second | ## Links * [Volcano Documentation](https://volcano.sh/docs/) * [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/) * [Spark on Volcano](https://volcano.sh/docs/spark/) * Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines * Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform