Files
homelab-design/decisions/0023-valkey-ml-caching.md
Billy D. 3a46a98be3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR index workflow, standardize all ADR formats
- Add Gitea Action to auto-update README badges and ADR table on push
- Standardize 8 ADRs from heading-style to inline metadata format
- Add shields.io badges for ADR counts (total/accepted/proposed)
- Replace static directory listing with linked ADR table in README
- Accept ADR-0030 (MFA/YubiKey strategy)
2026-02-09 17:25:27 -05:00

3.7 KiB

Valkey for ML Inference Caching

  • Status: accepted
  • Date: 2026-02-04
  • Deciders: Billy
  • Technical Story: Consolidate and configure Valkey for ML caching use cases

Context

The AI/ML platform requires caching infrastructure for multiple use cases:

  1. KV-Cache Offloading: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
  2. Embedding Cache: Frequently requested embeddings can be cached to avoid redundant GPU computation
  3. Session State: Conversation history and intermediate results for multi-turn interactions
  4. Ray Object Store Spillover: Large Ray objects can spill to external storage when memory is constrained

Previously, two separate Valkey instances existed:

  • valkey - General-purpose with 10Gi persistent storage
  • mlcache - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction

Analysis revealed that mlcache had zero consumers in the codebase - no services were actually connecting to it.

Decision

Consolidate to Single Valkey Instance

Remove mlcache and use the existing valkey instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.

Valkey Configuration

The current valkey instance at valkey.ai-ml.svc.cluster.local:6379:

Setting Value Rationale
Persistence 10Gi Longhorn PVC Survive restarts, cache warm-up
Memory 512Mi request, 2Gi limit Sufficient for current workloads
Auth Disabled Internal cluster-only access
Metrics Prometheus ServiceMonitor Observability

Future: vLLM KV-Cache Integration

When implementing LMCache or similar KV-cache offloading for vLLM:

# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine

engine = AsyncLLMEngine.from_engine_args(
    engine_args,
    kv_cache_config={
        "type": "redis",
        "url": "redis://valkey.ai-ml.svc.cluster.local:6379",
        "prefix": "vllm:kv:",
        "ttl": 3600,  # 1 hour cache lifetime
    }
)

If memory pressure becomes an issue, scale Valkey resources:

resources:
  limits:
    memory: "8Gi"  # Increase for larger KV-cache
extraArgs:
  - --maxmemory
  - 6gb
  - --maxmemory-policy
  - allkeys-lru

Key Prefixes Convention

To avoid collisions when multiple services share Valkey:

Service Prefix Example Key
vLLM KV-Cache vllm:kv: vllm:kv:layer0:tok123
Embeddings Cache emb: emb:sha256:abc123
Ray State ray: ray:actor:xyz
Session State session: session:user:123

Consequences

Positive

  • Reduced complexity: One cache instance instead of two
  • Resource efficiency: No unused mlcache consuming 4GB memory allocation
  • Operational simplicity: Single point of monitoring and maintenance
  • Cost savings: One less PVC, pod, and service to manage

Negative

  • Shared resource contention: All workloads share the same cache
  • Single point of failure: Cache unavailability affects all consumers

Mitigations

  • Namespace isolation via prefixes: Prevents key collisions
  • LRU eviction: Automatic cleanup when memory is constrained
  • Persistent storage: Cache survives pod restarts
  • Monitoring: Prometheus metrics for memory usage alerts

References