diff --git a/docs/adr/ADR-0023-valkey-ml-caching.md b/docs/adr/ADR-0023-valkey-ml-caching.md new file mode 100644 index 0000000..a7e5bb0 --- /dev/null +++ b/docs/adr/ADR-0023-valkey-ml-caching.md @@ -0,0 +1,108 @@ +# ADR-0023: Valkey for ML Inference Caching + +## Status + +Accepted + +## Context + +The AI/ML platform requires caching infrastructure for multiple use cases: + +1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows +2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation +3. **Session State**: Conversation history and intermediate results for multi-turn interactions +4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained + +Previously, two separate Valkey instances existed: +- `valkey` - General-purpose with 10Gi persistent storage +- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction + +Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it. + +## Decision + +### Consolidate to Single Valkey Instance + +Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance. + +### Valkey Configuration + +The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`: + +| Setting | Value | Rationale | +|---------|-------|-----------| +| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up | +| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads | +| Auth | Disabled | Internal cluster-only access | +| Metrics | Prometheus ServiceMonitor | Observability | + +### Future: vLLM KV-Cache Integration + +When implementing LMCache or similar KV-cache offloading for vLLM: + +```python +# In ray_serve/serve_llm.py +from vllm import AsyncLLMEngine + +engine = AsyncLLMEngine.from_engine_args( + engine_args, + kv_cache_config={ + "type": "redis", + "url": "redis://valkey.ai-ml.svc.cluster.local:6379", + "prefix": "vllm:kv:", + "ttl": 3600, # 1 hour cache lifetime + } +) +``` + +If memory pressure becomes an issue, scale Valkey resources: + +```yaml +resources: + limits: + memory: "8Gi" # Increase for larger KV-cache +extraArgs: + - --maxmemory + - 6gb + - --maxmemory-policy + - allkeys-lru +``` + +### Key Prefixes Convention + +To avoid collisions when multiple services share Valkey: + +| Service | Prefix | Example Key | +|---------|--------|-------------| +| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` | +| Embeddings Cache | `emb:` | `emb:sha256:abc123` | +| Ray State | `ray:` | `ray:actor:xyz` | +| Session State | `session:` | `session:user:123` | + +## Consequences + +### Positive + +- **Reduced complexity**: One cache instance instead of two +- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation +- **Operational simplicity**: Single point of monitoring and maintenance +- **Cost savings**: One less PVC, pod, and service to manage + +### Negative + +- **Shared resource contention**: All workloads share the same cache +- **Single point of failure**: Cache unavailability affects all consumers + +### Mitigations + +- **Namespace isolation via prefixes**: Prevents key collisions +- **LRU eviction**: Automatic cleanup when memory is constrained +- **Persistent storage**: Cache survives pod restarts +- **Monitoring**: Prometheus metrics for memory usage alerts + +## References + +- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) +- [LMCache Project](https://github.com/LMCache/LMCache) +- [Valkey Documentation](https://valkey.io/docs/) +- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)