Document decision to consolidate mlcache into single valkey instance. Includes future guidance for vLLM KV-cache offloading integration.
109 lines
3.6 KiB
Markdown
109 lines
3.6 KiB
Markdown
# ADR-0023: Valkey for ML Inference Caching
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The AI/ML platform requires caching infrastructure for multiple use cases:
|
|
|
|
1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
|
|
2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
|
|
3. **Session State**: Conversation history and intermediate results for multi-turn interactions
|
|
4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained
|
|
|
|
Previously, two separate Valkey instances existed:
|
|
- `valkey` - General-purpose with 10Gi persistent storage
|
|
- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
|
|
|
|
Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.
|
|
|
|
## Decision
|
|
|
|
### Consolidate to Single Valkey Instance
|
|
|
|
Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
|
|
|
|
### Valkey Configuration
|
|
|
|
The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:
|
|
|
|
| Setting | Value | Rationale |
|
|
|---------|-------|-----------|
|
|
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
|
|
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
|
|
| Auth | Disabled | Internal cluster-only access |
|
|
| Metrics | Prometheus ServiceMonitor | Observability |
|
|
|
|
### Future: vLLM KV-Cache Integration
|
|
|
|
When implementing LMCache or similar KV-cache offloading for vLLM:
|
|
|
|
```python
|
|
# In ray_serve/serve_llm.py
|
|
from vllm import AsyncLLMEngine
|
|
|
|
engine = AsyncLLMEngine.from_engine_args(
|
|
engine_args,
|
|
kv_cache_config={
|
|
"type": "redis",
|
|
"url": "redis://valkey.ai-ml.svc.cluster.local:6379",
|
|
"prefix": "vllm:kv:",
|
|
"ttl": 3600, # 1 hour cache lifetime
|
|
}
|
|
)
|
|
```
|
|
|
|
If memory pressure becomes an issue, scale Valkey resources:
|
|
|
|
```yaml
|
|
resources:
|
|
limits:
|
|
memory: "8Gi" # Increase for larger KV-cache
|
|
extraArgs:
|
|
- --maxmemory
|
|
- 6gb
|
|
- --maxmemory-policy
|
|
- allkeys-lru
|
|
```
|
|
|
|
### Key Prefixes Convention
|
|
|
|
To avoid collisions when multiple services share Valkey:
|
|
|
|
| Service | Prefix | Example Key |
|
|
|---------|--------|-------------|
|
|
| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
|
|
| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
|
|
| Ray State | `ray:` | `ray:actor:xyz` |
|
|
| Session State | `session:` | `session:user:123` |
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Reduced complexity**: One cache instance instead of two
|
|
- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
|
|
- **Operational simplicity**: Single point of monitoring and maintenance
|
|
- **Cost savings**: One less PVC, pod, and service to manage
|
|
|
|
### Negative
|
|
|
|
- **Shared resource contention**: All workloads share the same cache
|
|
- **Single point of failure**: Cache unavailability affects all consumers
|
|
|
|
### Mitigations
|
|
|
|
- **Namespace isolation via prefixes**: Prevents key collisions
|
|
- **LRU eviction**: Automatic cleanup when memory is constrained
|
|
- **Persistent storage**: Cache survives pod restarts
|
|
- **Monitoring**: Prometheus metrics for memory usage alerts
|
|
|
|
## References
|
|
|
|
- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
|
- [LMCache Project](https://github.com/LMCache/LMCache)
|
|
- [Valkey Documentation](https://valkey.io/docs/)
|
|
- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)
|