docs(adr): ADR-0023 Valkey for ML inference caching
Document decision to consolidate mlcache into single valkey instance. Includes future guidance for vLLM KV-cache offloading integration.
This commit is contained in:
108
docs/adr/ADR-0023-valkey-ml-caching.md
Normal file
108
docs/adr/ADR-0023-valkey-ml-caching.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# ADR-0023: Valkey for ML Inference Caching
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The AI/ML platform requires caching infrastructure for multiple use cases:
|
||||
|
||||
1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
|
||||
2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
|
||||
3. **Session State**: Conversation history and intermediate results for multi-turn interactions
|
||||
4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained
|
||||
|
||||
Previously, two separate Valkey instances existed:
|
||||
- `valkey` - General-purpose with 10Gi persistent storage
|
||||
- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
|
||||
|
||||
Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.
|
||||
|
||||
## Decision
|
||||
|
||||
### Consolidate to Single Valkey Instance
|
||||
|
||||
Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
|
||||
|
||||
### Valkey Configuration
|
||||
|
||||
The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
|
||||
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
|
||||
| Auth | Disabled | Internal cluster-only access |
|
||||
| Metrics | Prometheus ServiceMonitor | Observability |
|
||||
|
||||
### Future: vLLM KV-Cache Integration
|
||||
|
||||
When implementing LMCache or similar KV-cache offloading for vLLM:
|
||||
|
||||
```python
|
||||
# In ray_serve/serve_llm.py
|
||||
from vllm import AsyncLLMEngine
|
||||
|
||||
engine = AsyncLLMEngine.from_engine_args(
|
||||
engine_args,
|
||||
kv_cache_config={
|
||||
"type": "redis",
|
||||
"url": "redis://valkey.ai-ml.svc.cluster.local:6379",
|
||||
"prefix": "vllm:kv:",
|
||||
"ttl": 3600, # 1 hour cache lifetime
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
If memory pressure becomes an issue, scale Valkey resources:
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
memory: "8Gi" # Increase for larger KV-cache
|
||||
extraArgs:
|
||||
- --maxmemory
|
||||
- 6gb
|
||||
- --maxmemory-policy
|
||||
- allkeys-lru
|
||||
```
|
||||
|
||||
### Key Prefixes Convention
|
||||
|
||||
To avoid collisions when multiple services share Valkey:
|
||||
|
||||
| Service | Prefix | Example Key |
|
||||
|---------|--------|-------------|
|
||||
| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
|
||||
| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
|
||||
| Ray State | `ray:` | `ray:actor:xyz` |
|
||||
| Session State | `session:` | `session:user:123` |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Reduced complexity**: One cache instance instead of two
|
||||
- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
|
||||
- **Operational simplicity**: Single point of monitoring and maintenance
|
||||
- **Cost savings**: One less PVC, pod, and service to manage
|
||||
|
||||
### Negative
|
||||
|
||||
- **Shared resource contention**: All workloads share the same cache
|
||||
- **Single point of failure**: Cache unavailability affects all consumers
|
||||
|
||||
### Mitigations
|
||||
|
||||
- **Namespace isolation via prefixes**: Prevents key collisions
|
||||
- **LRU eviction**: Automatic cleanup when memory is constrained
|
||||
- **Persistent storage**: Cache survives pod restarts
|
||||
- **Monitoring**: Prometheus metrics for memory usage alerts
|
||||
|
||||
## References
|
||||
|
||||
- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
||||
- [LMCache Project](https://github.com/LMCache/LMCache)
|
||||
- [Valkey Documentation](https://valkey.io/docs/)
|
||||
- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)
|
||||
Reference in New Issue
Block a user