updating to match everything in my homelab.
This commit is contained in:
206
decisions/0034-volcano-batch-scheduling.md
Normal file
206
decisions/0034-volcano-batch-scheduling.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Volcano Batch Scheduling Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Optimize scheduling for batch ML and analytics workloads
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab runs diverse workloads including:
|
||||
- AI/ML training jobs (batch, GPU-intensive)
|
||||
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
|
||||
- KubeRay cluster with multiple GPU workers
|
||||
- Long-running inference services
|
||||
|
||||
The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
|
||||
- Gang scheduling (all-or-nothing pod placement)
|
||||
- Fair-share queuing across teams/projects
|
||||
- Preemption policies for priority workloads
|
||||
- Resource reservation for batch jobs
|
||||
|
||||
How do we optimize scheduling for batch and ML workloads?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Gang scheduling for distributed ML training
|
||||
* Fair-share resource allocation
|
||||
* Priority-based preemption
|
||||
* Integration with Kubeflow and Spark
|
||||
* GPU-aware scheduling
|
||||
* Queue management for multi-tenant scenarios
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Volcano Scheduler**
|
||||
2. **Apache YuniKorn**
|
||||
3. **Kubernetes default scheduler with Priority Classes**
|
||||
4. **Kueue (Kubernetes Batch Workload Queueing)**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Volcano Scheduler**
|
||||
|
||||
Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Gang scheduling prevents partial deployments
|
||||
* Queue-based fair-share resource management
|
||||
* Native Spark and Flink integration
|
||||
* Preemption for high-priority jobs
|
||||
* CNCF project with active community
|
||||
* Coexists with default scheduler
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Additional scheduler components (admission, controller, scheduler)
|
||||
* Learning curve for queue configuration
|
||||
* Workloads must opt-in via scheduler name
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Volcano System │
|
||||
│ (volcano-system namespace) │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │
|
||||
│ │ Admission │ │ Controllers │ │ Scheduler │ │
|
||||
│ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │
|
||||
│ └─────────────────┘ └───────────────────┘ └───────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Queues │
|
||||
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||
│ │ ml-training │ analytics │ inference │ default │ │
|
||||
│ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │
|
||||
│ └───────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Workloads │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │
|
||||
│ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Queue Definition
|
||||
|
||||
```yaml
|
||||
apiVersion: scheduling.volcano.sh/v1beta1
|
||||
kind: Queue
|
||||
metadata:
|
||||
name: ml-training
|
||||
spec:
|
||||
weight: 40
|
||||
reclaimable: true
|
||||
guarantee:
|
||||
resource:
|
||||
cpu: "4"
|
||||
memory: "16Gi"
|
||||
capability:
|
||||
resource:
|
||||
cpu: "32"
|
||||
memory: "128Gi"
|
||||
nvidia.com/gpu: "2"
|
||||
```
|
||||
|
||||
### Spark Integration
|
||||
|
||||
```yaml
|
||||
apiVersion: sparkoperator.k8s.io/v1beta2
|
||||
kind: SparkApplication
|
||||
metadata:
|
||||
name: analytics-job
|
||||
spec:
|
||||
batchScheduler: volcano
|
||||
batchSchedulerOptions:
|
||||
queue: analytics
|
||||
priorityClassName: normal
|
||||
driver:
|
||||
schedulerName: volcano
|
||||
executor:
|
||||
schedulerName: volcano
|
||||
instances: 4
|
||||
```
|
||||
|
||||
### Gang Scheduling for ML Training
|
||||
|
||||
```yaml
|
||||
apiVersion: batch.volcano.sh/v1alpha1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: distributed-training
|
||||
spec:
|
||||
schedulerName: volcano
|
||||
minAvailable: 4 # Gang: all 4 pods or none
|
||||
queue: ml-training
|
||||
tasks:
|
||||
- name: worker
|
||||
replicas: 4
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: trainer
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Queue Structure
|
||||
|
||||
| Queue | Weight | Use Case | Guarantee | Preemptable |
|
||||
|-------|--------|----------|-----------|-------------|
|
||||
| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
|
||||
| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
|
||||
| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
|
||||
| `default` | 10 | Miscellaneous batch | None | Yes |
|
||||
|
||||
## Scheduler Selection
|
||||
|
||||
Workloads use Volcano by setting:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
schedulerName: volcano
|
||||
```
|
||||
|
||||
Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
|
||||
|
||||
## Preemption Policy
|
||||
|
||||
```yaml
|
||||
apiVersion: scheduling.volcano.sh/v1beta1
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: high-priority
|
||||
spec:
|
||||
value: 1000
|
||||
preemptionPolicy: PreemptLowerPriority
|
||||
description: "High priority ML training jobs"
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `volcano_queue_allocated_*` | Resources currently allocated per queue |
|
||||
| `volcano_queue_pending_*` | Pending resource requests per queue |
|
||||
| `volcano_job_status` | Job lifecycle states |
|
||||
| `volcano_scheduler_throughput` | Scheduling decisions per second |
|
||||
|
||||
## Links
|
||||
|
||||
* [Volcano Documentation](https://volcano.sh/docs/)
|
||||
* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
|
||||
* [Spark on Volcano](https://volcano.sh/docs/spark/)
|
||||
* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
|
||||
* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform
|
||||
Reference in New Issue
Block a user