Files

Billy D. 8f4df84657 chore: Consolidate ADRs into decisions/ directory

- Added ADR-0016: Affine email verification strategy
- Moved ADRs 0019-0024 from docs/adr/ to decisions/
- Renamed to consistent format (removed ADR- prefix)

2026-02-04 08:28:12 -05:00

3.6 KiB

Raw Blame History

ADR-0023: Valkey for ML Inference Caching

Status

Accepted

Context

The AI/ML platform requires caching infrastructure for multiple use cases:

KV-Cache Offloading: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
Embedding Cache: Frequently requested embeddings can be cached to avoid redundant GPU computation
Session State: Conversation history and intermediate results for multi-turn interactions
Ray Object Store Spillover: Large Ray objects can spill to external storage when memory is constrained

Previously, two separate Valkey instances existed:

valkey - General-purpose with 10Gi persistent storage
mlcache - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction

Analysis revealed that mlcache had zero consumers in the codebase - no services were actually connecting to it.

Decision

Consolidate to Single Valkey Instance

Remove mlcache and use the existing valkey instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.

Valkey Configuration

The current valkey instance at valkey.ai-ml.svc.cluster.local:6379:

Setting	Value	Rationale
Persistence	10Gi Longhorn PVC	Survive restarts, cache warm-up
Memory	512Mi request, 2Gi limit	Sufficient for current workloads
Auth	Disabled	Internal cluster-only access
Metrics	Prometheus ServiceMonitor	Observability

Future: vLLM KV-Cache Integration

When implementing LMCache or similar KV-cache offloading for vLLM:

# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine

engine = AsyncLLMEngine.from_engine_args(
    engine_args,
    kv_cache_config={
        "type": "redis",
        "url": "redis://valkey.ai-ml.svc.cluster.local:6379",
        "prefix": "vllm:kv:",
        "ttl": 3600,  # 1 hour cache lifetime
    }
)

If memory pressure becomes an issue, scale Valkey resources:

resources:
  limits:
    memory: "8Gi"  # Increase for larger KV-cache
extraArgs:
  - --maxmemory
  - 6gb
  - --maxmemory-policy
  - allkeys-lru

Key Prefixes Convention

To avoid collisions when multiple services share Valkey:

Service	Prefix	Example Key
vLLM KV-Cache	`vllm:kv:`	`vllm:kv:layer0:tok123`
Embeddings Cache	`emb:`	`emb:sha256:abc123`
Ray State	`ray:`	`ray:actor:xyz`
Session State	`session:`	`session:user:123`

Consequences

Positive

Reduced complexity: One cache instance instead of two
Resource efficiency: No unused mlcache consuming 4GB memory allocation
Operational simplicity: Single point of monitoring and maintenance
Cost savings: One less PVC, pod, and service to manage

Negative

Shared resource contention: All workloads share the same cache
Single point of failure: Cache unavailability affects all consumers

Mitigations

Namespace isolation via prefixes: Prevents key collisions
LRU eviction: Automatic cleanup when memory is constrained
Persistent storage: Cache survives pod restarts
Monitoring: Prometheus metrics for memory usage alerts

3.6 KiB Raw Blame History