- Added ADR-0016: Affine email verification strategy - Moved ADRs 0019-0024 from docs/adr/ to decisions/ - Renamed to consistent format (removed ADR- prefix)
3.6 KiB
3.6 KiB
ADR-0023: Valkey for ML Inference Caching
Status
Accepted
Context
The AI/ML platform requires caching infrastructure for multiple use cases:
- KV-Cache Offloading: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
- Embedding Cache: Frequently requested embeddings can be cached to avoid redundant GPU computation
- Session State: Conversation history and intermediate results for multi-turn interactions
- Ray Object Store Spillover: Large Ray objects can spill to external storage when memory is constrained
Previously, two separate Valkey instances existed:
valkey- General-purpose with 10Gi persistent storagemlcache- ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
Analysis revealed that mlcache had zero consumers in the codebase - no services were actually connecting to it.
Decision
Consolidate to Single Valkey Instance
Remove mlcache and use the existing valkey instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
Valkey Configuration
The current valkey instance at valkey.ai-ml.svc.cluster.local:6379:
| Setting | Value | Rationale |
|---|---|---|
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
| Auth | Disabled | Internal cluster-only access |
| Metrics | Prometheus ServiceMonitor | Observability |
Future: vLLM KV-Cache Integration
When implementing LMCache or similar KV-cache offloading for vLLM:
# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine
engine = AsyncLLMEngine.from_engine_args(
engine_args,
kv_cache_config={
"type": "redis",
"url": "redis://valkey.ai-ml.svc.cluster.local:6379",
"prefix": "vllm:kv:",
"ttl": 3600, # 1 hour cache lifetime
}
)
If memory pressure becomes an issue, scale Valkey resources:
resources:
limits:
memory: "8Gi" # Increase for larger KV-cache
extraArgs:
- --maxmemory
- 6gb
- --maxmemory-policy
- allkeys-lru
Key Prefixes Convention
To avoid collisions when multiple services share Valkey:
| Service | Prefix | Example Key |
|---|---|---|
| vLLM KV-Cache | vllm:kv: |
vllm:kv:layer0:tok123 |
| Embeddings Cache | emb: |
emb:sha256:abc123 |
| Ray State | ray: |
ray:actor:xyz |
| Session State | session: |
session:user:123 |
Consequences
Positive
- Reduced complexity: One cache instance instead of two
- Resource efficiency: No unused mlcache consuming 4GB memory allocation
- Operational simplicity: Single point of monitoring and maintenance
- Cost savings: One less PVC, pod, and service to manage
Negative
- Shared resource contention: All workloads share the same cache
- Single point of failure: Cache unavailability affects all consumers
Mitigations
- Namespace isolation via prefixes: Prevents key collisions
- LRU eviction: Automatic cleanup when memory is constrained
- Persistent storage: Cache survives pod restarts
- Monitoring: Prometheus metrics for memory usage alerts