homelab-design/docs/adr/ADR-0023-valkey-ml-caching.md

# ADR-0023: Valkey for ML Inference Caching

## Status

Accepted

## Context

The AI/ML platform requires caching infrastructure for multiple use cases:

1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
3. **Session State**: Conversation history and intermediate results for multi-turn interactions
4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained

Previously, two separate Valkey instances existed:
- `valkey` - General-purpose with 10Gi persistent storage
- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction

Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.

## Decision

### Consolidate to Single Valkey Instance

Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.

### Valkey Configuration

The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:

| Setting | Value | Rationale |
|---------|-------|-----------|
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
| Auth | Disabled | Internal cluster-only access |
| Metrics | Prometheus ServiceMonitor | Observability |

### Future: vLLM KV-Cache Integration

When implementing LMCache or similar KV-cache offloading for vLLM:

```python
# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine

engine = AsyncLLMEngine.from_engine_args(
    engine_args,
    kv_cache_config={
        "type": "redis",
        "url": "redis://valkey.ai-ml.svc.cluster.local:6379",
        "prefix": "vllm:kv:",
        "ttl": 3600,  # 1 hour cache lifetime
    }
)
```

If memory pressure becomes an issue, scale Valkey resources:

```yaml
resources:
  limits:
    memory: "8Gi"  # Increase for larger KV-cache
extraArgs:
  - --maxmemory
  - 6gb
  - --maxmemory-policy
  - allkeys-lru
```

### Key Prefixes Convention

To avoid collisions when multiple services share Valkey:

| Service | Prefix | Example Key |
|---------|--------|-------------|
| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
| Ray State | `ray:` | `ray:actor:xyz` |
| Session State | `session:` | `session:user:123` |

## Consequences

### Positive

- **Reduced complexity**: One cache instance instead of two
- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
- **Operational simplicity**: Single point of monitoring and maintenance
- **Cost savings**: One less PVC, pod, and service to manage

### Negative

- **Shared resource contention**: All workloads share the same cache
- **Single point of failure**: Cache unavailability affects all consumers

### Mitigations

- **Namespace isolation via prefixes**: Prevents key collisions
- **LRU eviction**: Automatic cleanup when memory is constrained
- **Persistent storage**: Cache survives pod restarts
- **Monitoring**: Prometheus metrics for memory usage alerts

## References

- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
- [LMCache Project](https://github.com/LMCache/LMCache)
- [Valkey Documentation](https://valkey.io/docs/)
- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)