homelab-design/decisions/0034-volcano-batch-scheduling.md

# Volcano Batch Scheduling Strategy

* Status: accepted
* Date: 2026-02-05
* Deciders: Billy
* Technical Story: Optimize scheduling for batch ML and analytics workloads

## Context and Problem Statement

The homelab runs diverse workloads including:
- AI/ML training jobs (batch, GPU-intensive)
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
- KubeRay cluster with multiple GPU workers
- Long-running inference services

The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
- Gang scheduling (all-or-nothing pod placement)
- Fair-share queuing across teams/projects
- Preemption policies for priority workloads
- Resource reservation for batch jobs

How do we optimize scheduling for batch and ML workloads?

## Decision Drivers

* Gang scheduling for distributed ML training
* Fair-share resource allocation
* Priority-based preemption
* Integration with Kubeflow and Spark
* GPU-aware scheduling
* Queue management for multi-tenant scenarios

## Considered Options

1. **Volcano Scheduler**
2. **Apache YuniKorn**
3. **Kubernetes default scheduler with Priority Classes**
4. **Kueue (Kubernetes Batch Workload Queueing)**

## Decision Outcome

Chosen option: **Option 1 - Volcano Scheduler**

Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.

### Positive Consequences

* Gang scheduling prevents partial deployments
* Queue-based fair-share resource management
* Native Spark and Flink integration
* Preemption for high-priority jobs
* CNCF project with active community
* Coexists with default scheduler

### Negative Consequences

* Additional scheduler components (admission, controller, scheduler)
* Learning curve for queue configuration
* Workloads must opt-in via scheduler name

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Volcano System                            │
│                     (volcano-system namespace)                   │
│                                                                  │
│  ┌─────────────────┐  ┌───────────────────┐  ┌───────────────┐  │
│  │   Admission     │  │   Controllers     │  │   Scheduler   │  │
│  │   Webhook       │  │   (Job lifecycle) │  │   (Placement) │  │
│  └─────────────────┘  └───────────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                         Queues                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  ml-training    │  analytics    │  inference   │  default  │  │
│  │  weight: 40     │  weight: 30   │  weight: 20  │  weight: 10│ │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Workloads                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │ Spark Jobs   │  │ Flink Jobs   │  │ ML Training (KFP)    │   │
│  │ (analytics)  │  │ (analytics)  │  │ (ml-training)        │   │
│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
```

## Configuration

### Queue Definition

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-training
spec:
  weight: 40
  reclaimable: true
  guarantee:
    resource:
      cpu: "4"
      memory: "16Gi"
  capability:
    resource:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "2"
```

### Spark Integration

```yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: analytics-job
spec:
  batchScheduler: volcano
  batchSchedulerOptions:
    queue: analytics
    priorityClassName: normal
  driver:
    schedulerName: volcano
  executor:
    schedulerName: volcano
    instances: 4
```

### Gang Scheduling for ML Training

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  schedulerName: volcano
  minAvailable: 4  # Gang: all 4 pods or none
  queue: ml-training
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: trainer
              resources:
                limits:
                  nvidia.com/gpu: 1
```

## Queue Structure

| Queue | Weight | Use Case | Guarantee | Preemptable |
|-------|--------|----------|-----------|-------------|
| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
| `default` | 10 | Miscellaneous batch | None | Yes |

## Scheduler Selection

Workloads use Volcano by setting:

```yaml
spec:
  schedulerName: volcano
```

Long-running services (inference endpoints, databases) continue using the default scheduler for stability.

## Preemption Policy

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PriorityClass
metadata:
  name: high-priority
spec:
  value: 1000
  preemptionPolicy: PreemptLowerPriority
  description: "High priority ML training jobs"
```

## Monitoring

| Metric | Description |
|--------|-------------|
| `volcano_queue_allocated_*` | Resources currently allocated per queue |
| `volcano_queue_pending_*` | Pending resource requests per queue |
| `volcano_job_status` | Job lifecycle states |
| `volcano_scheduler_throughput` | Scheduling decisions per second |

## Links

* [Volcano Documentation](https://volcano.sh/docs/)
* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
* [Spark on Volcano](https://volcano.sh/docs/spark/)
* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform