docs(adr): ADR-0023 Valkey for ML inference caching

Document decision to consolidate mlcache into single valkey instance. Includes future guidance for vLLM KV-cache offloading integration.
2026-02-02 17:30:08 -05:00
parent e85deaa642
commit add1b5b71e
1 changed files with 108 additions and 0 deletions
--- a/docs/adr/ADR-0023-valkey-ml-caching.md
+++ b/docs/adr/ADR-0023-valkey-ml-caching.md
@@ -0,0 +1,108 @@
+# ADR-0023: Valkey for ML Inference Caching
+
+## Status
+
+Accepted
+
+## Context
+
+The AI/ML platform requires caching infrastructure for multiple use cases:
+
+1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
+2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
+3. **Session State**: Conversation history and intermediate results for multi-turn interactions
+4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained
+
+Previously, two separate Valkey instances existed:
+- `valkey` - General-purpose with 10Gi persistent storage
+- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
+
+Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.
+
+## Decision
+
+### Consolidate to Single Valkey Instance
+
+Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
+
+### Valkey Configuration
+
+The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:
+
+| Setting | Value | Rationale |
+|---------|-------|-----------|
+| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
+| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
+| Auth | Disabled | Internal cluster-only access |
+| Metrics | Prometheus ServiceMonitor | Observability |
+
+### Future: vLLM KV-Cache Integration
+
+When implementing LMCache or similar KV-cache offloading for vLLM:
+
+```python
+# In ray_serve/serve_llm.py
+from vllm import AsyncLLMEngine
+
+engine = AsyncLLMEngine.from_engine_args(
+    engine_args,
+    kv_cache_config={
+        "type": "redis",
+        "url": "redis://valkey.ai-ml.svc.cluster.local:6379",
+        "prefix": "vllm:kv:",
+        "ttl": 3600,  # 1 hour cache lifetime
+    }
+)
+```
+
+If memory pressure becomes an issue, scale Valkey resources:
+
+```yaml
+resources:
+  limits:
+    memory: "8Gi"  # Increase for larger KV-cache
+extraArgs:
+  - --maxmemory
+  - 6gb
+  - --maxmemory-policy
+  - allkeys-lru
+```
+
+### Key Prefixes Convention
+
+To avoid collisions when multiple services share Valkey:
+
+| Service | Prefix | Example Key |
+|---------|--------|-------------|
+| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
+| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
+| Ray State | `ray:` | `ray:actor:xyz` |
+| Session State | `session:` | `session:user:123` |
+
+## Consequences
+
+### Positive
+
+- **Reduced complexity**: One cache instance instead of two
+- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
+- **Operational simplicity**: Single point of monitoring and maintenance
+- **Cost savings**: One less PVC, pod, and service to manage
+
+### Negative
+
+- **Shared resource contention**: All workloads share the same cache
+- **Single point of failure**: Cache unavailability affects all consumers
+
+### Mitigations
+
+- **Namespace isolation via prefixes**: Prevents key collisions
+- **LRU eviction**: Automatic cleanup when memory is constrained
+- **Persistent storage**: Cache survives pod restarts
+- **Monitoring**: Prometheus metrics for memory usage alerts
+
+## References
+
+- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
+- [LMCache Project](https://github.com/LMCache/LMCache)
+- [Valkey Documentation](https://valkey.io/docs/)
+- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)