Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADR index workflow, standardize all ADR formats

- Add Gitea Action to auto-update README badges and ADR table on push
- Standardize 8 ADRs from heading-style to inline metadata format
- Add shields.io badges for ADR counts (total/accepted/proposed)
- Replace static directory listing with linked ADR table in README
- Accept ADR-0030 (MFA/YubiKey strategy)

2026-02-09 17:25:27 -05:00

3.7 KiB

Raw Blame History

Valkey for ML Inference Caching

Status: accepted
Date: 2026-02-04
Deciders: Billy
Technical Story: Consolidate and configure Valkey for ML caching use cases

Context

The AI/ML platform requires caching infrastructure for multiple use cases:

KV-Cache Offloading: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
Embedding Cache: Frequently requested embeddings can be cached to avoid redundant GPU computation
Session State: Conversation history and intermediate results for multi-turn interactions
Ray Object Store Spillover: Large Ray objects can spill to external storage when memory is constrained

Previously, two separate Valkey instances existed:

valkey - General-purpose with 10Gi persistent storage
mlcache - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction

Analysis revealed that mlcache had zero consumers in the codebase - no services were actually connecting to it.

Decision

Consolidate to Single Valkey Instance

Remove mlcache and use the existing valkey instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.

Valkey Configuration

The current valkey instance at valkey.ai-ml.svc.cluster.local:6379:

Setting	Value	Rationale
Persistence	10Gi Longhorn PVC	Survive restarts, cache warm-up
Memory	512Mi request, 2Gi limit	Sufficient for current workloads
Auth	Disabled	Internal cluster-only access
Metrics	Prometheus ServiceMonitor	Observability

Future: vLLM KV-Cache Integration

When implementing LMCache or similar KV-cache offloading for vLLM:

# In ray_serve/serve_llm.py
from vllm import AsyncLLMEngine

engine = AsyncLLMEngine.from_engine_args(
    engine_args,
    kv_cache_config={
        "type": "redis",
        "url": "redis://valkey.ai-ml.svc.cluster.local:6379",
        "prefix": "vllm:kv:",
        "ttl": 3600,  # 1 hour cache lifetime
    }
)

If memory pressure becomes an issue, scale Valkey resources:

resources:
  limits:
    memory: "8Gi"  # Increase for larger KV-cache
extraArgs:
  - --maxmemory
  - 6gb
  - --maxmemory-policy
  - allkeys-lru

Key Prefixes Convention

To avoid collisions when multiple services share Valkey:

Service	Prefix	Example Key
vLLM KV-Cache	`vllm:kv:`	`vllm:kv:layer0:tok123`
Embeddings Cache	`emb:`	`emb:sha256:abc123`
Ray State	`ray:`	`ray:actor:xyz`
Session State	`session:`	`session:user:123`

Consequences

Positive

Reduced complexity: One cache instance instead of two
Resource efficiency: No unused mlcache consuming 4GB memory allocation
Operational simplicity: Single point of monitoring and maintenance
Cost savings: One less PVC, pod, and service to manage

Negative

Shared resource contention: All workloads share the same cache
Single point of failure: Cache unavailability affects all consumers

Mitigations

Namespace isolation via prefixes: Prevents key collisions
LRU eviction: Automatic cleanup when memory is constrained
Persistent storage: Cache survives pod restarts
Monitoring: Prometheus metrics for memory usage alerts

3.7 KiB Raw Blame History