daviestechlabs/homelab-design

Fork 0

Files

Billy D. 80fb911e22 updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00

7.7 KiB

Raw Permalink Blame History

Volcano Batch Scheduling Strategy

Status: accepted
Date: 2026-02-05
Deciders: Billy
Technical Story: Optimize scheduling for batch ML and analytics workloads

Context and Problem Statement

The homelab runs diverse workloads including:

AI/ML training jobs (batch, GPU-intensive)
Spark/Flink analytics jobs (batch, CPU/memory-intensive)
KubeRay cluster with multiple GPU workers
Long-running inference services

The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:

Gang scheduling (all-or-nothing pod placement)
Fair-share queuing across teams/projects
Preemption policies for priority workloads
Resource reservation for batch jobs

How do we optimize scheduling for batch and ML workloads?

Decision Drivers

Gang scheduling for distributed ML training
Fair-share resource allocation
Priority-based preemption
Integration with Kubeflow and Spark
GPU-aware scheduling
Queue management for multi-tenant scenarios

Considered Options

Volcano Scheduler
Apache YuniKorn
Kubernetes default scheduler with Priority Classes
Kueue (Kubernetes Batch Workload Queueing)

Decision Outcome

Chosen option: Option 1 - Volcano Scheduler

Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.

Positive Consequences

Gang scheduling prevents partial deployments
Queue-based fair-share resource management
Native Spark and Flink integration
Preemption for high-priority jobs
CNCF project with active community
Coexists with default scheduler

Negative Consequences

Additional scheduler components (admission, controller, scheduler)
Learning curve for queue configuration
Workloads must opt-in via scheduler name

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Volcano System                            │
│                     (volcano-system namespace)                   │
│                                                                  │
│  ┌─────────────────┐  ┌───────────────────┐  ┌───────────────┐  │
│  │   Admission     │  │   Controllers     │  │   Scheduler   │  │
│  │   Webhook       │  │   (Job lifecycle) │  │   (Placement) │  │
│  └─────────────────┘  └───────────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Queues                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  ml-training    │  analytics    │  inference   │  default  │  │
│  │  weight: 40     │  weight: 30   │  weight: 20  │  weight: 10│ │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Workloads                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │ Spark Jobs   │  │ Flink Jobs   │  │ ML Training (KFP)    │   │
│  │ (analytics)  │  │ (analytics)  │  │ (ml-training)        │   │
│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Configuration

Queue Definition

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-training
spec:
  weight: 40
  reclaimable: true
  guarantee:
    resource:
      cpu: "4"
      memory: "16Gi"
  capability:
    resource:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "2"

Spark Integration

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: analytics-job
spec:
  batchScheduler: volcano
  batchSchedulerOptions:
    queue: analytics
    priorityClassName: normal
  driver:
    schedulerName: volcano
  executor:
    schedulerName: volcano
    instances: 4

Gang Scheduling for ML Training

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  schedulerName: volcano
  minAvailable: 4  # Gang: all 4 pods or none
  queue: ml-training
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: trainer
              resources:
                limits:
                  nvidia.com/gpu: 1

Queue Structure

Queue	Weight	Use Case	Guarantee	Preemptable
`ml-training`	40	Kubeflow jobs, RayJobs	4 CPU, 16Gi	No
`analytics`	30	Spark/Flink batch jobs	2 CPU, 8Gi	Yes
`inference`	20	Batch inference jobs	2 CPU, 8Gi	No
`default`	10	Miscellaneous batch	None	Yes

Scheduler Selection

Workloads use Volcano by setting:

spec:
  schedulerName: volcano

Long-running services (inference endpoints, databases) continue using the default scheduler for stability.

Preemption Policy

apiVersion: scheduling.volcano.sh/v1beta1
kind: PriorityClass
metadata:
  name: high-priority
spec:
  value: 1000
  preemptionPolicy: PreemptLowerPriority
  description: "High priority ML training jobs"

Monitoring

Metric	Description
`volcano_queue_allocated_*`	Resources currently allocated per queue
`volcano_queue_pending_*`	Pending resource requests per queue
`volcano_job_status`	Job lifecycle states
`volcano_scheduler_throughput`	Scheduling decisions per second

7.7 KiB Raw Permalink Blame History