207 lines
7.7 KiB
Markdown
207 lines
7.7 KiB
Markdown
# Volcano Batch Scheduling Strategy
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-05
|
|
* Deciders: Billy
|
|
* Technical Story: Optimize scheduling for batch ML and analytics workloads
|
|
|
|
## Context and Problem Statement
|
|
|
|
The homelab runs diverse workloads including:
|
|
- AI/ML training jobs (batch, GPU-intensive)
|
|
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
|
|
- KubeRay cluster with multiple GPU workers
|
|
- Long-running inference services
|
|
|
|
The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
|
|
- Gang scheduling (all-or-nothing pod placement)
|
|
- Fair-share queuing across teams/projects
|
|
- Preemption policies for priority workloads
|
|
- Resource reservation for batch jobs
|
|
|
|
How do we optimize scheduling for batch and ML workloads?
|
|
|
|
## Decision Drivers
|
|
|
|
* Gang scheduling for distributed ML training
|
|
* Fair-share resource allocation
|
|
* Priority-based preemption
|
|
* Integration with Kubeflow and Spark
|
|
* GPU-aware scheduling
|
|
* Queue management for multi-tenant scenarios
|
|
|
|
## Considered Options
|
|
|
|
1. **Volcano Scheduler**
|
|
2. **Apache YuniKorn**
|
|
3. **Kubernetes default scheduler with Priority Classes**
|
|
4. **Kueue (Kubernetes Batch Workload Queueing)**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - Volcano Scheduler**
|
|
|
|
Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
|
|
|
|
### Positive Consequences
|
|
|
|
* Gang scheduling prevents partial deployments
|
|
* Queue-based fair-share resource management
|
|
* Native Spark and Flink integration
|
|
* Preemption for high-priority jobs
|
|
* CNCF project with active community
|
|
* Coexists with default scheduler
|
|
|
|
### Negative Consequences
|
|
|
|
* Additional scheduler components (admission, controller, scheduler)
|
|
* Learning curve for queue configuration
|
|
* Workloads must opt-in via scheduler name
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Volcano System │
|
|
│ (volcano-system namespace) │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │
|
|
│ │ Admission │ │ Controllers │ │ Scheduler │ │
|
|
│ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │
|
|
│ └─────────────────┘ └───────────────────┘ └───────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Queues │
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
|
│ │ ml-training │ analytics │ inference │ default │ │
|
|
│ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Workloads │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
|
│ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │
|
|
│ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Queue Definition
|
|
|
|
```yaml
|
|
apiVersion: scheduling.volcano.sh/v1beta1
|
|
kind: Queue
|
|
metadata:
|
|
name: ml-training
|
|
spec:
|
|
weight: 40
|
|
reclaimable: true
|
|
guarantee:
|
|
resource:
|
|
cpu: "4"
|
|
memory: "16Gi"
|
|
capability:
|
|
resource:
|
|
cpu: "32"
|
|
memory: "128Gi"
|
|
nvidia.com/gpu: "2"
|
|
```
|
|
|
|
### Spark Integration
|
|
|
|
```yaml
|
|
apiVersion: sparkoperator.k8s.io/v1beta2
|
|
kind: SparkApplication
|
|
metadata:
|
|
name: analytics-job
|
|
spec:
|
|
batchScheduler: volcano
|
|
batchSchedulerOptions:
|
|
queue: analytics
|
|
priorityClassName: normal
|
|
driver:
|
|
schedulerName: volcano
|
|
executor:
|
|
schedulerName: volcano
|
|
instances: 4
|
|
```
|
|
|
|
### Gang Scheduling for ML Training
|
|
|
|
```yaml
|
|
apiVersion: batch.volcano.sh/v1alpha1
|
|
kind: Job
|
|
metadata:
|
|
name: distributed-training
|
|
spec:
|
|
schedulerName: volcano
|
|
minAvailable: 4 # Gang: all 4 pods or none
|
|
queue: ml-training
|
|
tasks:
|
|
- name: worker
|
|
replicas: 4
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: trainer
|
|
resources:
|
|
limits:
|
|
nvidia.com/gpu: 1
|
|
```
|
|
|
|
## Queue Structure
|
|
|
|
| Queue | Weight | Use Case | Guarantee | Preemptable |
|
|
|-------|--------|----------|-----------|-------------|
|
|
| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
|
|
| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
|
|
| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
|
|
| `default` | 10 | Miscellaneous batch | None | Yes |
|
|
|
|
## Scheduler Selection
|
|
|
|
Workloads use Volcano by setting:
|
|
|
|
```yaml
|
|
spec:
|
|
schedulerName: volcano
|
|
```
|
|
|
|
Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
|
|
|
|
## Preemption Policy
|
|
|
|
```yaml
|
|
apiVersion: scheduling.volcano.sh/v1beta1
|
|
kind: PriorityClass
|
|
metadata:
|
|
name: high-priority
|
|
spec:
|
|
value: 1000
|
|
preemptionPolicy: PreemptLowerPriority
|
|
description: "High priority ML training jobs"
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `volcano_queue_allocated_*` | Resources currently allocated per queue |
|
|
| `volcano_queue_pending_*` | Pending resource requests per queue |
|
|
| `volcano_job_status` | Job lifecycle states |
|
|
| `volcano_scheduler_throughput` | Scheduling decisions per second |
|
|
|
|
## Links
|
|
|
|
* [Volcano Documentation](https://volcano.sh/docs/)
|
|
* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
|
|
* [Spark on Volcano](https://volcano.sh/docs/spark/)
|
|
* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
|
|
* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform
|