Files
homelab-design/decisions/0023-valkey-ml-caching.md
Billy D. 8f4df84657 chore: Consolidate ADRs into decisions/ directory
- Added ADR-0016: Affine email verification strategy
- Moved ADRs 0019-0024 from docs/adr/ to decisions/
- Renamed to consistent format (removed ADR- prefix)
2026-02-04 08:28:12 -05:00

109 lines
3.6 KiB
Markdown

# ADR-0023: Valkey for ML Inference Caching
## Status
Accepted
## Context
The AI/ML platform requires caching infrastructure for multiple use cases:
1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
3. **Session State**: Conversation history and intermediate results for multi-turn interactions
4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained
Previously, two separate Valkey instances existed:
- `valkey` - General-purpose with 10Gi persistent storage
- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.
## Decision
### Consolidate to Single Valkey Instance
Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
### Valkey Configuration
The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:
| Setting | Value | Rationale |
|---------|-------|-----------|
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
| Auth | Disabled | Internal cluster-only access |
| Metrics | Prometheus ServiceMonitor | Observability |
### Future: vLLM KV-Cache Integration
When implementing LMCache or similar KV-cache offloading for vLLM:
```python
# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine
engine = AsyncLLMEngine.from_engine_args(
engine_args,
kv_cache_config={
"type": "redis",
"url": "redis://valkey.ai-ml.svc.cluster.local:6379",
"prefix": "vllm:kv:",
"ttl": 3600, # 1 hour cache lifetime
}
)
```
If memory pressure becomes an issue, scale Valkey resources:
```yaml
resources:
limits:
memory: "8Gi" # Increase for larger KV-cache
extraArgs:
- --maxmemory
- 6gb
- --maxmemory-policy
- allkeys-lru
```
### Key Prefixes Convention
To avoid collisions when multiple services share Valkey:
| Service | Prefix | Example Key |
|---------|--------|-------------|
| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
| Ray State | `ray:` | `ray:actor:xyz` |
| Session State | `session:` | `session:user:123` |
## Consequences
### Positive
- **Reduced complexity**: One cache instance instead of two
- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
- **Operational simplicity**: Single point of monitoring and maintenance
- **Cost savings**: One less PVC, pod, and service to manage
### Negative
- **Shared resource contention**: All workloads share the same cache
- **Single point of failure**: Cache unavailability affects all consumers
### Mitigations
- **Namespace isolation via prefixes**: Prevents key collisions
- **LRU eviction**: Automatic cleanup when memory is constrained
- **Persistent storage**: Cache survives pod restarts
- **Monitoring**: Prometheus metrics for memory usage alerts
## References
- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
- [LMCache Project](https://github.com/LMCache/LMCache)
- [Valkey Documentation](https://valkey.io/docs/)
- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)