# Valkey for ML Inference Caching * Status: accepted * Date: 2026-02-04 * Deciders: Billy * Technical Story: Consolidate and configure Valkey for ML caching use cases ## Context The AI/ML platform requires caching infrastructure for multiple use cases: 1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows 2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation 3. **Session State**: Conversation history and intermediate results for multi-turn interactions 4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained Previously, two separate Valkey instances existed: - `valkey` - General-purpose with 10Gi persistent storage - `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it. ## Decision ### Consolidate to Single Valkey Instance Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance. ### Valkey Configuration The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`: | Setting | Value | Rationale | |---------|-------|-----------| | Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up | | Memory | 512Mi request, 2Gi limit | Sufficient for current workloads | | Auth | Disabled | Internal cluster-only access | | Metrics | Prometheus ServiceMonitor | Observability | ### Future: vLLM KV-Cache Integration When implementing LMCache or similar KV-cache offloading for vLLM: ```python # In ray_serve/serve_llm.py from vllm import AsyncLLMEngine engine = AsyncLLMEngine.from_engine_args( engine_args, kv_cache_config={ "type": "redis", "url": "redis://valkey.ai-ml.svc.cluster.local:6379", "prefix": "vllm:kv:", "ttl": 3600, # 1 hour cache lifetime } ) ``` If memory pressure becomes an issue, scale Valkey resources: ```yaml resources: limits: memory: "8Gi" # Increase for larger KV-cache extraArgs: - --maxmemory - 6gb - --maxmemory-policy - allkeys-lru ``` ### Key Prefixes Convention To avoid collisions when multiple services share Valkey: | Service | Prefix | Example Key | |---------|--------|-------------| | vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` | | Embeddings Cache | `emb:` | `emb:sha256:abc123` | | Ray State | `ray:` | `ray:actor:xyz` | | Session State | `session:` | `session:user:123` | ## Consequences ### Positive - **Reduced complexity**: One cache instance instead of two - **Resource efficiency**: No unused mlcache consuming 4GB memory allocation - **Operational simplicity**: Single point of monitoring and maintenance - **Cost savings**: One less PVC, pod, and service to manage ### Negative - **Shared resource contention**: All workloads share the same cache - **Single point of failure**: Cache unavailability affects all consumers ### Mitigations - **Namespace isolation via prefixes**: Prevents key collisions - **LRU eviction**: Automatic cleanup when memory is constrained - **Persistent storage**: Cache survives pod restarts - **Monitoring**: Prometheus metrics for memory usage alerts ## References - [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) - [LMCache Project](https://github.com/LMCache/LMCache) - [Valkey Documentation](https://valkey.io/docs/) - [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)