Files
homelab-design/decisions/0034-volcano-batch-scheduling.md

7.7 KiB

Volcano Batch Scheduling Strategy

  • Status: accepted
  • Date: 2026-02-05
  • Deciders: Billy
  • Technical Story: Optimize scheduling for batch ML and analytics workloads

Context and Problem Statement

The homelab runs diverse workloads including:

  • AI/ML training jobs (batch, GPU-intensive)
  • Spark/Flink analytics jobs (batch, CPU/memory-intensive)
  • KubeRay cluster with multiple GPU workers
  • Long-running inference services

The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:

  • Gang scheduling (all-or-nothing pod placement)
  • Fair-share queuing across teams/projects
  • Preemption policies for priority workloads
  • Resource reservation for batch jobs

How do we optimize scheduling for batch and ML workloads?

Decision Drivers

  • Gang scheduling for distributed ML training
  • Fair-share resource allocation
  • Priority-based preemption
  • Integration with Kubeflow and Spark
  • GPU-aware scheduling
  • Queue management for multi-tenant scenarios

Considered Options

  1. Volcano Scheduler
  2. Apache YuniKorn
  3. Kubernetes default scheduler with Priority Classes
  4. Kueue (Kubernetes Batch Workload Queueing)

Decision Outcome

Chosen option: Option 1 - Volcano Scheduler

Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.

Positive Consequences

  • Gang scheduling prevents partial deployments
  • Queue-based fair-share resource management
  • Native Spark and Flink integration
  • Preemption for high-priority jobs
  • CNCF project with active community
  • Coexists with default scheduler

Negative Consequences

  • Additional scheduler components (admission, controller, scheduler)
  • Learning curve for queue configuration
  • Workloads must opt-in via scheduler name

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Volcano System                            │
│                     (volcano-system namespace)                   │
│                                                                  │
│  ┌─────────────────┐  ┌───────────────────┐  ┌───────────────┐  │
│  │   Admission     │  │   Controllers     │  │   Scheduler   │  │
│  │   Webhook       │  │   (Job lifecycle) │  │   (Placement) │  │
│  └─────────────────┘  └───────────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Queues                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  ml-training    │  analytics    │  inference   │  default  │  │
│  │  weight: 40     │  weight: 30   │  weight: 20  │  weight: 10│ │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Workloads                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │ Spark Jobs   │  │ Flink Jobs   │  │ ML Training (KFP)    │   │
│  │ (analytics)  │  │ (analytics)  │  │ (ml-training)        │   │
│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Configuration

Queue Definition

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-training
spec:
  weight: 40
  reclaimable: true
  guarantee:
    resource:
      cpu: "4"
      memory: "16Gi"
  capability:
    resource:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "2"

Spark Integration

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: analytics-job
spec:
  batchScheduler: volcano
  batchSchedulerOptions:
    queue: analytics
    priorityClassName: normal
  driver:
    schedulerName: volcano
  executor:
    schedulerName: volcano
    instances: 4

Gang Scheduling for ML Training

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  schedulerName: volcano
  minAvailable: 4  # Gang: all 4 pods or none
  queue: ml-training
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: trainer
              resources:
                limits:
                  nvidia.com/gpu: 1

Queue Structure

Queue Weight Use Case Guarantee Preemptable
ml-training 40 Kubeflow jobs, RayJobs 4 CPU, 16Gi No
analytics 30 Spark/Flink batch jobs 2 CPU, 8Gi Yes
inference 20 Batch inference jobs 2 CPU, 8Gi No
default 10 Miscellaneous batch None Yes

Scheduler Selection

Workloads use Volcano by setting:

spec:
  schedulerName: volcano

Long-running services (inference endpoints, databases) continue using the default scheduler for stability.

Preemption Policy

apiVersion: scheduling.volcano.sh/v1beta1
kind: PriorityClass
metadata:
  name: high-priority
spec:
  value: 1000
  preemptionPolicy: PreemptLowerPriority
  description: "High priority ML training jobs"

Monitoring

Metric Description
volcano_queue_allocated_* Resources currently allocated per queue
volcano_queue_pending_* Pending resource requests per queue
volcano_job_status Job lifecycle states
volcano_scheduler_throughput Scheduling decisions per second