chore: Consolidate ADRs into decisions/ directory
- Added ADR-0016: Affine email verification strategy - Moved ADRs 0019-0024 from docs/adr/ to decisions/ - Renamed to consistent format (removed ADR- prefix)
This commit is contained in:
150
decisions/0016-affine-email-verification-strategy.md
Normal file
150
decisions/0016-affine-email-verification-strategy.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# Affine Email Verification Strategy for Authentik OIDC
|
||||
|
||||
* Status: proposed
|
||||
* Date: 2026-02-04
|
||||
* Deciders: Billy
|
||||
* Technical Story: Affine requires email verification for users, but Authentik is not configured with SMTP for email delivery
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Affine (self-hosted note-taking/collaboration tool) requires users to have verified email addresses. When users authenticate via Authentik OIDC, Affine checks the `email_verified` claim. Currently, Authentik has no SMTP configuration, so it cannot send verification emails, causing new users to be blocked or have limited functionality in Affine.
|
||||
|
||||
How can we satisfy Affine's email verification requirement without adding significant infrastructure complexity to the homelab?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Minimize external dependencies and ongoing costs
|
||||
* Keep the solution self-contained within the homelab
|
||||
* Avoid breaking changes on Affine upgrades
|
||||
* Maintain security - don't completely bypass verification for untrusted users
|
||||
* Simple to implement and maintain
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Override `email_verified` claim in Authentik** - Configure Authentik to always return `email_verified: true` for trusted users
|
||||
2. **Deploy local SMTP server (Mailpit)** - Run a lightweight mail capture server in-cluster
|
||||
3. **Configure Affine to skip verification for OIDC users** - Use Affine's configuration to trust OIDC-provided emails
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 (Override `email_verified` claim)** as the primary solution, with Option 3 as a fallback if Affine supports it.
|
||||
|
||||
This approach requires zero additional infrastructure, works immediately, and is appropriate for a homelab where all users are trusted (family/personal use). Option 2 (Mailpit) is documented for future reference if actual email delivery becomes needed for other applications.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* No additional services to deploy or maintain
|
||||
* Works immediately with existing Authentik setup
|
||||
* No external dependencies or costs
|
||||
* Can be easily reverted if requirements change
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Bypasses "real" email verification - relies on trust
|
||||
* If Affine is ever exposed to untrusted users, this would need revisiting
|
||||
* Other applications expecting real email verification would need similar workarounds
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1: Override `email_verified` Claim in Authentik
|
||||
|
||||
Configure an Authentik property mapping to always return `email_verified: true` in the OIDC token for the Affine application.
|
||||
|
||||
**Implementation:**
|
||||
1. In Authentik Admin → Customization → Property Mappings
|
||||
2. Create a new "Scope Mapping" for email_verified
|
||||
3. Set expression: `return True`
|
||||
4. Assign to Affine OIDC provider
|
||||
|
||||
* Good, because zero infrastructure required
|
||||
* Good, because immediate solution
|
||||
* Good, because appropriate for trusted homelab users
|
||||
* Bad, because not "real" verification
|
||||
* Bad, because per-application configuration needed
|
||||
|
||||
### Option 2: Deploy Local SMTP Server (Mailpit)
|
||||
|
||||
Deploy Mailpit (or MailHog) as a lightweight SMTP server in the cluster that captures all emails for viewing via web UI.
|
||||
|
||||
**Implementation:**
|
||||
```yaml
|
||||
# Example Mailpit deployment
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: mailpit
|
||||
namespace: productivity
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: mailpit
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: mailpit
|
||||
image: axllent/mailpit:latest
|
||||
ports:
|
||||
- containerPort: 1025 # SMTP
|
||||
- containerPort: 8025 # Web UI
|
||||
```
|
||||
|
||||
Then configure Authentik SMTP settings:
|
||||
- Host: `mailpit.productivity.svc.cluster.local`
|
||||
- Port: `1025`
|
||||
- TLS: disabled (internal traffic)
|
||||
|
||||
* Good, because provides actual email flow for testing
|
||||
* Good, because useful for other apps needing email (password reset, notifications)
|
||||
* Good, because emails viewable via web UI
|
||||
* Bad, because emails don't actually leave the cluster
|
||||
* Bad, because another service to maintain
|
||||
* Bad, because requires Authentik reconfiguration
|
||||
|
||||
### Option 3: Configure Affine to Skip Verification for OIDC Users
|
||||
|
||||
If Affine supports it, configure the application to trust email addresses from OIDC providers without requiring separate verification.
|
||||
|
||||
**Potential Configuration (needs verification):**
|
||||
```yaml
|
||||
# In affine-config ConfigMap
|
||||
AFFINE_AUTH_OIDC_EMAIL_VERIFIED: "true"
|
||||
# or similar environment variable
|
||||
```
|
||||
|
||||
* Good, because no Authentik changes needed
|
||||
* Good, because scoped to Affine only
|
||||
* Bad, because may not be supported by Affine
|
||||
* Bad, because could break on Affine upgrades
|
||||
* Bad, because requires research into Affine's configuration options
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### For Option 1 (Recommended)
|
||||
|
||||
1. Access Authentik admin at `https://auth.daviestechlabs.io/if/admin/`
|
||||
2. Navigate to Customization → Property Mappings
|
||||
3. Create new Scope Mapping:
|
||||
- Name: `Affine Email Verified Override`
|
||||
- Scope name: `email`
|
||||
- Expression:
|
||||
```python
|
||||
return {
|
||||
"email": request.user.email,
|
||||
"email_verified": True,
|
||||
}
|
||||
```
|
||||
4. Edit the Affine OIDC Provider → Advanced Settings → Scope Mappings
|
||||
5. Replace default email mapping with the new override
|
||||
|
||||
### Future Considerations
|
||||
|
||||
If the homelab expands to include external users or applications requiring real email delivery:
|
||||
- Revisit Option 2 (Mailpit) for development/testing
|
||||
- Consider external SMTP service (SendGrid free tier, AWS SES) for production email
|
||||
|
||||
## References
|
||||
|
||||
* [Authentik Property Mappings Documentation](https://docs.goauthentik.io/docs/property-mappings)
|
||||
* [Affine Self-Hosting Documentation](https://docs.affine.pro/docs/self-host-affine)
|
||||
* [Mailpit GitHub](https://github.com/axllent/mailpit)
|
||||
365
decisions/0019-handler-deployment-strategy.md
Normal file
365
decisions/0019-handler-deployment-strategy.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# ADR-0019: Python Module Deployment Strategy
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
|
||||
2026-02-02
|
||||
|
||||
## Context
|
||||
|
||||
We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
|
||||
|
||||
| Repo | Purpose | Needs GPU? |
|
||||
|------|---------|------------|
|
||||
| `handler-base` | Shared library (NATS, clients, telemetry) | No |
|
||||
| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
|
||||
| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
|
||||
| `pipeline-bridge` | Kubeflow ↔ NATS integration | No |
|
||||
| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** |
|
||||
|
||||
### Current Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ PLATFORM LAYERS │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │
|
||||
│ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ RAY CLUSTER │
|
||||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Ray Serve Applications (GPU inference) │ │
|
||||
│ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │
|
||||
│ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │
|
||||
│ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │
|
||||
│ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │
|
||||
│ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │
|
||||
│ ├────────────────────────────────────────────────────────────────┤ │
|
||||
│ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
|
||||
│ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │
|
||||
│ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs.
|
||||
They should run inside the Ray cluster to:
|
||||
1. Use Ray's internal calling (faster than HTTP)
|
||||
2. Share observability (Ray Dashboard)
|
||||
3. Leverage Ray's scheduling for resource management
|
||||
|
||||
## Decision
|
||||
|
||||
**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env`
|
||||
to install Python packages from Gitea's package registry at deployment time.
|
||||
|
||||
### Why Ray Serve (not standalone containers)?
|
||||
|
||||
1. **Unified Platform**: Everything runs in Ray - inference AND orchestration
|
||||
2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP)
|
||||
3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration
|
||||
4. **Single Observability**: Ray Dashboard shows all applications
|
||||
5. **Simpler Ops**: One RayService to manage, not multiple Deployments
|
||||
|
||||
### Why runtime_env with pip (not baked into images)?
|
||||
|
||||
1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService
|
||||
2. **Decoupled Releases**: Handlers update independently of worker images
|
||||
3. **Smaller Images**: Worker images only need inference dependencies
|
||||
4. **MLflow Integration**: Can version handlers as MLflow models if needed
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Publish Packages to Gitea PyPI
|
||||
|
||||
Each handler repo publishes to Gitea's built-in package registry on release:
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/ci.yml
|
||||
name: CI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
tags: ['v*']
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
# ... existing lint job
|
||||
|
||||
test:
|
||||
# ... existing test job
|
||||
|
||||
publish:
|
||||
runs-on: ubuntu-latest
|
||||
needs: [lint, test]
|
||||
if: startsWith(github.ref, 'refs/tags/v')
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.11'
|
||||
|
||||
- name: Install uv
|
||||
uses: astral-sh/setup-uv@v5
|
||||
|
||||
- name: Build package
|
||||
run: uv build
|
||||
|
||||
- name: Publish to Gitea PyPI
|
||||
env:
|
||||
UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
|
||||
UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
|
||||
run: uv publish
|
||||
```
|
||||
|
||||
### Phase 2: Update RayService with Handler Applications
|
||||
|
||||
Add handler applications to the existing RayService:
|
||||
|
||||
```yaml
|
||||
# rayservice.yaml additions
|
||||
spec:
|
||||
serveConfigV2: |
|
||||
applications:
|
||||
# ... existing GPU inference applications ...
|
||||
|
||||
# ============================================
|
||||
# HANDLERS (CPU - runs on head node)
|
||||
# ============================================
|
||||
|
||||
# Chat Handler - RAG + LLM pipeline
|
||||
- name: chat-handler
|
||||
route_prefix: /chat
|
||||
import_path: chat_handler:app
|
||||
runtime_env:
|
||||
pip:
|
||||
- handler-base>=0.1.0
|
||||
- chat-handler>=0.1.0
|
||||
pip_find_links:
|
||||
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||
env_vars:
|
||||
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
|
||||
deployments:
|
||||
- name: ChatDeployment
|
||||
num_replicas: 2
|
||||
ray_actor_options:
|
||||
num_cpus: 0.5
|
||||
num_gpus: 0 # No GPU needed
|
||||
max_ongoing_requests: 50
|
||||
|
||||
# Voice Assistant - STT → RAG → LLM → TTS pipeline
|
||||
- name: voice-assistant
|
||||
route_prefix: /voice
|
||||
import_path: voice_assistant:app
|
||||
runtime_env:
|
||||
pip:
|
||||
- handler-base>=0.1.0
|
||||
- voice-assistant>=0.1.0
|
||||
pip_find_links:
|
||||
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||
env_vars:
|
||||
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
||||
deployments:
|
||||
- name: VoiceDeployment
|
||||
num_replicas: 2
|
||||
ray_actor_options:
|
||||
num_cpus: 1
|
||||
num_gpus: 0
|
||||
max_ongoing_requests: 20
|
||||
```
|
||||
|
||||
### Phase 3: Refactor Handlers for Ray Serve
|
||||
|
||||
Convert handlers from standalone NATS subscribers to Ray Serve deployments that can
|
||||
also optionally subscribe to NATS:
|
||||
|
||||
```python
|
||||
# chat_handler.py (refactored)
|
||||
from ray import serve
|
||||
from handler_base import Settings
|
||||
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient
|
||||
|
||||
@serve.deployment(
|
||||
name="ChatDeployment",
|
||||
num_replicas=2,
|
||||
ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
|
||||
)
|
||||
class ChatHandler:
|
||||
def __init__(self):
|
||||
self.settings = Settings()
|
||||
|
||||
# Initialize clients - these can use Ray handles for internal calls
|
||||
self.embeddings = EmbeddingsClient()
|
||||
self.llm = LLMClient()
|
||||
self.reranker = RerankerClient()
|
||||
self.milvus = MilvusClient()
|
||||
|
||||
async def __call__(self, request) -> dict:
|
||||
"""Handle HTTP requests (from Gradio, etc.)"""
|
||||
data = await request.json()
|
||||
return await self.process_chat(data)
|
||||
|
||||
async def process_chat(self, data: dict) -> dict:
|
||||
"""Core chat logic - called by HTTP or NATS"""
|
||||
query = data["query"]
|
||||
|
||||
# 1. Generate embeddings
|
||||
embedding = await self.embeddings.embed(query)
|
||||
|
||||
# 2. Vector search
|
||||
results = await self.milvus.search(embedding, top_k=10)
|
||||
|
||||
# 3. Rerank
|
||||
reranked = await self.reranker.rerank(query, results)
|
||||
|
||||
# 4. Generate response
|
||||
response = await self.llm.generate(query, context=reranked[:5])
|
||||
|
||||
return {
|
||||
"response": response,
|
||||
"sources": reranked[:5]
|
||||
}
|
||||
|
||||
# Ray Serve app binding
|
||||
app = ChatHandler.bind()
|
||||
```
|
||||
|
||||
### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)
|
||||
|
||||
Update handler-base clients to use Ray handles when running inside Ray:
|
||||
|
||||
```python
|
||||
# handler_base/clients/embeddings.py
|
||||
import ray
|
||||
from ray import serve
|
||||
|
||||
class EmbeddingsClient:
|
||||
def __init__(self, url: str = None):
|
||||
self.url = url
|
||||
self._handle = None
|
||||
|
||||
# If running inside Ray, get handle to embeddings deployment
|
||||
if ray.is_initialized():
|
||||
try:
|
||||
self._handle = serve.get_deployment_handle(
|
||||
"EmbeddingsDeployment",
|
||||
app_name="embeddings"
|
||||
)
|
||||
except Exception:
|
||||
pass # Fall back to HTTP
|
||||
|
||||
async def embed(self, text: str) -> list[float]:
|
||||
if self._handle:
|
||||
# Fast internal Ray call
|
||||
return await self._handle.embed.remote(text)
|
||||
else:
|
||||
# HTTP fallback for external callers
|
||||
async with httpx.AsyncClient() as client:
|
||||
resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
|
||||
return resp.json()["data"][0]["embedding"]
|
||||
```
|
||||
|
||||
### Phase 5: NATS Bridge (Optional)
|
||||
|
||||
If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:
|
||||
|
||||
```python
|
||||
# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
|
||||
import ray
|
||||
from ray import serve
|
||||
import nats
|
||||
|
||||
@ray.remote
|
||||
class NATSBridge:
|
||||
def __init__(self):
|
||||
self.nc = None
|
||||
self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
|
||||
self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
|
||||
|
||||
async def start(self):
|
||||
self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
|
||||
|
||||
await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
|
||||
await self.nc.subscribe("voice.request", cb=self.handle_voice)
|
||||
|
||||
async def handle_chat(self, msg):
|
||||
result = await self.chat_handle.process_chat.remote(msg.data)
|
||||
if msg.reply:
|
||||
await self.nc.publish(msg.reply, result)
|
||||
```
|
||||
|
||||
## CI/CD Flow
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ Developer pushes to handler repo │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 1. Gitea Actions: lint → test │
|
||||
│ 2. On tag: build wheel → publish to Gitea PyPI │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 3. Update RayService version in homelab-k8s2 │
|
||||
│ (bump handler-base>=0.2.0 in runtime_env) │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 4. Flux detects change → applies RayService │
|
||||
│ 5. Ray downloads new packages → restarts deployments │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Standalone Container Deployments
|
||||
|
||||
Run handlers as separate Kubernetes Deployments outside Ray.
|
||||
|
||||
**Rejected because:**
|
||||
- Duplicates infrastructure (separate scaling, health checks, etc.)
|
||||
- HTTP overhead for every inference call
|
||||
- Separate observability stack
|
||||
- Against the "Ray as unified compute" philosophy
|
||||
|
||||
### Bake Handlers into Worker Images
|
||||
|
||||
Pre-install handler code in ray-worker images.
|
||||
|
||||
**Rejected because:**
|
||||
- Couples handler releases to image rebuilds
|
||||
- Slower iteration cycle
|
||||
- Larger images
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Single platform: Everything runs in Ray
|
||||
- Fast internal calls via Ray handles
|
||||
- Unified observability in Ray Dashboard
|
||||
- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
|
||||
- Handlers scale with Ray's autoscaler
|
||||
|
||||
### Negative
|
||||
- Handlers share Ray head node resources
|
||||
- Need to manage Gitea PyPI authentication for runtime_env
|
||||
- Slightly more complex RayService configuration
|
||||
|
||||
### Neutral
|
||||
- MLflow can track handler "models" if we want versioned deployments
|
||||
- Kubeflow can trigger handler updates via pipelines
|
||||
|
||||
## References
|
||||
|
||||
- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md)
|
||||
- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html)
|
||||
- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/)
|
||||
- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)
|
||||
133
decisions/0020-internal-registry-for-cicd.md
Normal file
133
decisions/0020-internal-registry-for-cicd.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# ADR-0020: Internal Registry URLs for CI/CD
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
|
||||
2026-02-02
|
||||
|
||||
## Context
|
||||
|
||||
| Factor | Details |
|
||||
|--------|---------|
|
||||
| Problem | Cloudflare proxying limits uploads to 100MB per request |
|
||||
| Impact | Docker images (20GB+) and large packages fail to push |
|
||||
| Current Setup | Gitea at `git.daviestechlabs.io` behind Cloudflare |
|
||||
| Internal Access | `registry.lab.daviestechlabs.io` bypasses Cloudflare |
|
||||
|
||||
Our Gitea instance is accessible via two URLs:
|
||||
- **External**: `git.daviestechlabs.io` - proxied through Cloudflare (DDoS protection, caching)
|
||||
- **Internal**: `registry.lab.daviestechlabs.io` - direct access from cluster network
|
||||
|
||||
Cloudflare's free tier enforces a 100MB upload limit per request. This blocks:
|
||||
- Docker image pushes (multi-GB layers)
|
||||
- Large Python package uploads
|
||||
- Any artifact exceeding 100MB
|
||||
|
||||
## Decision
|
||||
|
||||
**Use internal registry URLs for all CI/CD artifact uploads.**
|
||||
|
||||
CI/CD workflows running on Gitea Actions runners (which are inside the cluster) should use `registry.lab.daviestechlabs.io` for:
|
||||
- Docker image pushes
|
||||
- PyPI package uploads
|
||||
- Any large artifact uploads
|
||||
|
||||
External URLs remain for:
|
||||
- Git operations (clone, push, pull)
|
||||
- Package downloads (pip install, docker pull)
|
||||
- Human access via browser
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ INTERNET │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Cloudflare │
|
||||
│ (100MB limit) │
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ git.daviestechlabs.io │
|
||||
│ (external access) │
|
||||
└──────────────────────────────┘
|
||||
│
|
||||
│ same Gitea instance
|
||||
│
|
||||
┌──────────────────────────────┐
|
||||
│ registry.lab.daviestechlabs │
|
||||
│ (internal, no limits) │
|
||||
└──────────────────────────────┘
|
||||
▲
|
||||
│ direct upload
|
||||
│
|
||||
┌──────────────────────────────┐
|
||||
│ Gitea Actions Runner │
|
||||
│ (in-cluster) │
|
||||
└──────────────────────────────┘
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **No upload size limits** for CI/CD artifacts
|
||||
- **Faster uploads** (no Cloudflare proxy overhead)
|
||||
- **Lower latency** for in-cluster operations
|
||||
- **Cost savings** (reduced Cloudflare bandwidth)
|
||||
|
||||
### Negative
|
||||
|
||||
- **Two URLs to maintain** in workflow configurations
|
||||
- **Runners must be in-cluster** (cannot use external runners for uploads)
|
||||
- **DNS split-horizon** required if accessing from outside
|
||||
|
||||
### Neutral
|
||||
|
||||
- External users can still pull packages/images via Cloudflare URL
|
||||
- Git operations continue through external URL (small payloads)
|
||||
|
||||
## Implementation
|
||||
|
||||
### Docker Registry Login
|
||||
|
||||
```yaml
|
||||
- name: Login to Internal Registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: registry.lab.daviestechlabs.io
|
||||
username: ${{ secrets.REGISTRY_USER }}
|
||||
password: ${{ secrets.REGISTRY_TOKEN }}
|
||||
```
|
||||
|
||||
### PyPI Upload
|
||||
|
||||
```yaml
|
||||
- name: Publish to Gitea PyPI
|
||||
run: |
|
||||
twine upload \
|
||||
--repository-url https://registry.lab.daviestechlabs.io/api/packages/daviestechlabs/pypi \
|
||||
dist/*
|
||||
```
|
||||
|
||||
### Environment Variable Pattern
|
||||
|
||||
For consistency across workflows:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
REGISTRY_EXTERNAL: git.daviestechlabs.io
|
||||
REGISTRY_INTERNAL: registry.lab.daviestechlabs.io
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [ADR-0019: Handler Deployment Strategy](ADR-0019-handler-deployment-strategy.md) - Uses PyPI publishing
|
||||
- Cloudflare upload limits: https://developers.cloudflare.com/workers/platform/limits/
|
||||
131
decisions/0021-notification-architecture.md
Normal file
131
decisions/0021-notification-architecture.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# ADR-0021: Notification Architecture
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The homelab infrastructure generates notifications from multiple sources:
|
||||
|
||||
1. **CI/CD pipelines** (Gitea Actions) - build success/failure
|
||||
2. **Alertmanager** - Prometheus alerts for critical/warning conditions
|
||||
3. **Gatus** - Service health monitoring
|
||||
4. **Flux** - GitOps reconciliation events
|
||||
5. **Service readiness** - Notifications when deployments complete successfully
|
||||
|
||||
Currently, ntfy serves as the primary notification hub, but there are several issues:
|
||||
|
||||
- **Topic inconsistency**: CI workflows were posting to `builds` while documentation (ADR-0015) specified `gitea-ci`
|
||||
- **No Alertmanager integration**: Critical Prometheus alerts had no delivery mechanism
|
||||
- **No service readiness notifications**: No visibility when services come online after deployment
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. ntfy as the Notification Hub
|
||||
|
||||
ntfy will serve as the central notification aggregation point. All internal services publish to ntfy topics via the internal Kubernetes service URL:
|
||||
|
||||
```
|
||||
http://ntfy-svc.observability.svc.cluster.local/<topic>
|
||||
```
|
||||
|
||||
This keeps ntfy auth-protected externally while allowing internal services to publish freely.
|
||||
|
||||
### 2. Standardized Topics
|
||||
|
||||
| Topic | Source | Description |
|
||||
|-------|--------|-------------|
|
||||
| `gitea-ci` | Gitea Actions | CI/CD build notifications |
|
||||
| `alertmanager-alerts` | Alertmanager | Prometheus critical/warning alerts |
|
||||
| `gatus` | Gatus | Service health status changes |
|
||||
| `flux` | Flux | GitOps reconciliation events |
|
||||
| `deployments` | Flux/Argo | Service deployment completions |
|
||||
|
||||
### 3. Alertmanager Integration
|
||||
|
||||
Alertmanager is configured to forward alerts to ntfy using the built-in `tpl=alertmanager` template:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
- name: ntfy-critical
|
||||
webhookConfigs:
|
||||
- url: "http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts?tpl=alertmanager&priority=urgent&tags=rotating_light"
|
||||
sendResolved: true
|
||||
- name: ntfy-warning
|
||||
webhookConfigs:
|
||||
- url: "http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts?tpl=alertmanager&priority=high&tags=warning"
|
||||
sendResolved: true
|
||||
```
|
||||
|
||||
Routes direct alerts based on severity:
|
||||
- `severity=critical` → `ntfy-critical` receiver
|
||||
- `severity=warning` → `ntfy-warning` receiver
|
||||
|
||||
### 4. Service Readiness Notifications
|
||||
|
||||
To provide visibility when services are fully operational after deployment:
|
||||
|
||||
**Option A: Flux Notification Controller**
|
||||
Configure Flux's notification-controller to send alerts when Kustomizations/HelmReleases succeed:
|
||||
|
||||
```yaml
|
||||
apiVersion: notification.toolkit.fluxcd.io/v1beta3
|
||||
kind: Provider
|
||||
metadata:
|
||||
name: ntfy-deployments
|
||||
spec:
|
||||
type: generic-hmac # or generic
|
||||
address: http://ntfy-svc.observability.svc.cluster.local/deployments
|
||||
---
|
||||
apiVersion: notification.toolkit.fluxcd.io/v1beta3
|
||||
kind: Alert
|
||||
metadata:
|
||||
name: deployment-success
|
||||
spec:
|
||||
providerRef:
|
||||
name: ntfy-deployments
|
||||
eventSeverity: info
|
||||
eventSources:
|
||||
- kind: Kustomization
|
||||
name: '*'
|
||||
- kind: HelmRelease
|
||||
name: '*'
|
||||
inclusionList:
|
||||
- ".*succeeded.*"
|
||||
```
|
||||
|
||||
**Option B: Argo Workflows Post-Deploy Hook**
|
||||
For Argo-managed deployments, add a notification step at workflow completion.
|
||||
|
||||
**Recommendation**: Use Flux Notification Controller (Option A) as it's already part of the GitOps stack and provides native integration.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Single source of truth**: All notifications flow through ntfy
|
||||
- **Auth protection maintained**: External ntfy access requires Authentik auth
|
||||
- **Deployment visibility**: Know when services are ready without watching logs
|
||||
- **Consistent topic naming**: All sources follow documented conventions
|
||||
|
||||
### Negative
|
||||
|
||||
- **Configuration overhead**: Each notification source requires explicit configuration
|
||||
|
||||
### Neutral
|
||||
|
||||
- Topic naming must be documented and followed consistently
|
||||
- Future Discord integration addressed in ADR-0022
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
- [x] Standardize CI notifications to `gitea-ci` topic
|
||||
- [x] Configure Alertmanager → ntfy for critical/warning alerts
|
||||
- [ ] Configure Flux notification-controller for deployment notifications
|
||||
- [ ] Add `deployments` topic subscription to ntfy app
|
||||
|
||||
## Related
|
||||
|
||||
- ADR-0015: CI Notifications and Semantic Versioning
|
||||
- ADR-0022: ntfy-Discord Bridge Service
|
||||
302
decisions/0022-ntfy-discord-bridge.md
Normal file
302
decisions/0022-ntfy-discord-bridge.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# ADR-0022: ntfy-Discord Bridge Service
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Per ADR-0021, ntfy serves as the central notification hub for the homelab. However, Discord is used for team collaboration and visibility, requiring notifications to be forwarded there as well.
|
||||
|
||||
ntfy does not natively support Discord webhook format. Discord expects a specific JSON structure with embeds, while ntfy uses its own message format. A bridge service is needed to:
|
||||
|
||||
1. Subscribe to ntfy topics
|
||||
2. Transform messages to Discord embed format
|
||||
3. Forward to Discord webhooks
|
||||
|
||||
## Decision
|
||||
|
||||
### Architecture
|
||||
|
||||
A dedicated Go microservice (`ntfy-discord`) will bridge ntfy to Discord:
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐
|
||||
│ CI/Alertmanager │────▶│ ntfy │────▶│ ntfy App │
|
||||
│ Gatus/Flux │ │ (notification │ │ (mobile) │
|
||||
└─────────────────┘ │ hub) │ └─────────────┘
|
||||
└────────┬─────────┘
|
||||
│ SSE/JSON stream
|
||||
▼
|
||||
┌──────────────────┐ ┌─────────────┐
|
||||
│ ntfy-discord │────▶│ Discord │
|
||||
│ (Go) │ │ Webhook │
|
||||
└──────────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
### Service Design
|
||||
|
||||
**Repository**: `ntfy-discord`
|
||||
|
||||
**Technology Stack**:
|
||||
- Go 1.22+
|
||||
- `fsnotify` for hot reload of secrets/config
|
||||
- Standard library `net/http` for SSE subscription
|
||||
- `slog` for structured logging
|
||||
- Scratch/distroless base image (~10MB final image)
|
||||
|
||||
**Why Go over Python**:
|
||||
- **Smaller images**: ~10MB vs ~150MB+ for Python
|
||||
- **Cloud native**: Single static binary, no runtime dependencies
|
||||
- **Memory efficient**: Lower RSS, ideal for always-on bridge
|
||||
- **Concurrency**: Goroutines for SSE handling and webhook delivery
|
||||
- **Compile-time safety**: Catch errors before deployment
|
||||
|
||||
**Core Features**:
|
||||
|
||||
1. **SSE Subscription**: Connect to ntfy's JSON stream endpoint for real-time messages
|
||||
2. **Automatic Reconnection**: Exponential backoff on connection failures
|
||||
3. **Message Transformation**: Convert ntfy format to Discord embed format
|
||||
4. **Priority Mapping**: Map ntfy priorities to Discord embed colors
|
||||
5. **Topic Routing**: Configure which topics go to which Discord channels/webhooks
|
||||
6. **Hot Reload**: Watch mounted secrets/configmaps with fsnotify, reload without restart
|
||||
7. **Health Endpoint**: `/health` and `/ready` for Kubernetes probes
|
||||
8. **Metrics**: Prometheus metrics at `/metrics`
|
||||
|
||||
### Hot Reload Implementation
|
||||
|
||||
Kubernetes mounts secrets as symlinked files that update atomically. The bridge uses `fsnotify` to watch for changes:
|
||||
|
||||
```go
|
||||
// Watch for secret changes and reload config
|
||||
func (b *Bridge) watchSecrets(ctx context.Context, secretPath string) {
|
||||
watcher, _ := fsnotify.NewWatcher()
|
||||
defer watcher.Close()
|
||||
|
||||
watcher.Add(secretPath)
|
||||
|
||||
for {
|
||||
select {
|
||||
case event := <-watcher.Events:
|
||||
if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) {
|
||||
slog.Info("secret changed, reloading config")
|
||||
b.reloadConfig(secretPath)
|
||||
}
|
||||
case <-ctx.Done():
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This allows ExternalSecrets to rotate the Discord webhook URL without pod restarts.
|
||||
|
||||
### Configuration
|
||||
|
||||
Configuration via environment variables and mounted secrets:
|
||||
|
||||
```yaml
|
||||
# Environment variables (ConfigMap)
|
||||
NTFY_URL: "http://ntfy.observability.svc.cluster.local"
|
||||
NTFY_TOPICS: "gitea-ci,alertmanager-alerts,flux-deployments,gatus"
|
||||
LOG_LEVEL: "info"
|
||||
METRICS_ENABLED: "true"
|
||||
|
||||
# Mounted secret (hot-reloadable)
|
||||
/secrets/discord-webhook-url # Single webhook for all topics
|
||||
# OR for topic routing:
|
||||
/secrets/topic-webhooks.yaml # YAML mapping topics to webhooks
|
||||
```
|
||||
|
||||
Topic routing file (optional):
|
||||
```yaml
|
||||
gitea-ci: "https://discord.com/api/webhooks/xxx/ci"
|
||||
alertmanager-alerts: "https://discord.com/api/webhooks/xxx/alerts"
|
||||
flux-deployments: "https://discord.com/api/webhooks/xxx/deploys"
|
||||
default: "https://discord.com/api/webhooks/xxx/general"
|
||||
```
|
||||
|
||||
### Message Transformation
|
||||
|
||||
ntfy message:
|
||||
```json
|
||||
{
|
||||
"id": "abc123",
|
||||
"topic": "gitea-ci",
|
||||
"title": "Build succeeded",
|
||||
"message": "ray-serve-apps published to PyPI",
|
||||
"priority": 3,
|
||||
"tags": ["package", "white_check_mark"],
|
||||
"time": 1770050091
|
||||
}
|
||||
```
|
||||
|
||||
Discord embed:
|
||||
```json
|
||||
{
|
||||
"embeds": [{
|
||||
"title": "✅ Build succeeded",
|
||||
"description": "ray-serve-apps published to PyPI",
|
||||
"color": 3066993,
|
||||
"fields": [
|
||||
{"name": "Topic", "value": "gitea-ci", "inline": true}
|
||||
],
|
||||
"timestamp": "2026-02-02T11:34:51Z",
|
||||
"footer": {"text": "ntfy"}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Priority → Color Mapping**:
|
||||
| Priority | Name | Discord Color |
|
||||
|----------|------|---------------|
|
||||
| 5 | Max/Urgent | 🔴 Red (15158332) |
|
||||
| 4 | High | 🟠 Orange (15105570) |
|
||||
| 3 | Default | 🔵 Blue (3066993) |
|
||||
| 2 | Low | ⚪ Gray (9807270) |
|
||||
| 1 | Min | ⚪ Light Gray (12370112) |
|
||||
|
||||
**Tag → Emoji Mapping**:
|
||||
Common ntfy tags are converted to Discord-friendly emojis in the title:
|
||||
- `white_check_mark` / `heavy_check_mark` → ✅
|
||||
- `x` / `skull` → ❌
|
||||
- `warning` → ⚠️
|
||||
- `rotating_light` → 🚨
|
||||
- `rocket` → 🚀
|
||||
- `package` → 📦
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: ntfy-discord
|
||||
namespace: observability
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: ntfy-discord
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: ntfy-discord
|
||||
spec:
|
||||
containers:
|
||||
- name: bridge
|
||||
image: gitea-http.gitea.svc.cluster.local:3000/daviestechlabs/ntfy-discord:latest
|
||||
env:
|
||||
- name: NTFY_URL
|
||||
value: "http://ntfy.observability.svc.cluster.local"
|
||||
- name: NTFY_TOPICS
|
||||
value: "gitea-ci,alertmanager-alerts,flux-deployments"
|
||||
- name: SECRETS_PATH
|
||||
value: "/secrets"
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
name: http
|
||||
volumeMounts:
|
||||
- name: discord-secrets
|
||||
mountPath: /secrets
|
||||
readOnly: true
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: http
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: http
|
||||
periodSeconds: 10
|
||||
resources:
|
||||
limits:
|
||||
cpu: 50m
|
||||
memory: 32Mi
|
||||
requests:
|
||||
cpu: 5m
|
||||
memory: 16Mi
|
||||
volumes:
|
||||
- name: discord-secrets
|
||||
secret:
|
||||
secretName: discord-webhook-secret
|
||||
```
|
||||
|
||||
### Secret Management
|
||||
|
||||
Discord webhook URL stored in Vault at `kv/data/discord`:
|
||||
|
||||
```yaml
|
||||
apiVersion: external-secrets.io/v1beta1
|
||||
kind: ExternalSecret
|
||||
metadata:
|
||||
name: discord-webhook-secret
|
||||
namespace: observability
|
||||
spec:
|
||||
refreshInterval: 1h
|
||||
secretStoreRef:
|
||||
name: vault
|
||||
kind: ClusterSecretStore
|
||||
target:
|
||||
name: discord-webhook-secret
|
||||
data:
|
||||
- secretKey: webhook-url
|
||||
remoteRef:
|
||||
key: kv/data/discord
|
||||
property: webhook_url
|
||||
```
|
||||
|
||||
When ExternalSecrets refreshes and updates the secret, the bridge detects the file change and reloads without restart.
|
||||
|
||||
### Error Handling
|
||||
|
||||
1. **Connection Loss**: Exponential backoff (1s, 2s, 4s, ... max 60s)
|
||||
2. **Discord Rate Limits**: Respect `Retry-After` header, queue messages
|
||||
3. **Invalid Messages**: Log and skip, don't crash
|
||||
4. **Webhook Errors**: Log error, continue processing other messages
|
||||
5. **Config Reload Errors**: Log error, keep using previous config
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Tiny footprint**: ~10MB image, 16MB memory
|
||||
- **Hot reload**: Secrets update without pod restart
|
||||
- **Robust**: Proper reconnection and error handling
|
||||
- **Observable**: Structured logging, Prometheus metrics, health endpoints
|
||||
- **Fast startup**: <100ms cold start
|
||||
- **Cloud native**: Static binary, distroless image
|
||||
|
||||
### Negative
|
||||
|
||||
- **Go learning curve**: Different patterns than Python services
|
||||
- **Operational Overhead**: Another service to maintain
|
||||
- **Latency**: Adds ~50-100ms to notification delivery
|
||||
|
||||
### Neutral
|
||||
|
||||
- Webhook URL must be maintained in Vault
|
||||
- Service logs should be monitored for errors
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
- [x] Create `ntfy-discord` repository
|
||||
- [ ] Implement core bridge logic
|
||||
- [ ] Add SSE client with reconnection
|
||||
- [ ] Implement message transformation
|
||||
- [ ] Add fsnotify hot reload for secrets
|
||||
- [ ] Add health/ready/metrics endpoints
|
||||
- [ ] Write unit tests
|
||||
- [ ] Create multi-stage Dockerfile (scratch base)
|
||||
- [ ] Set up CI/CD pipeline (Gitea Actions)
|
||||
- [ ] Add ExternalSecret for Discord webhook
|
||||
- [ ] Create Kubernetes manifests
|
||||
- [ ] Deploy to observability namespace
|
||||
- [ ] Verify notifications flowing to Discord
|
||||
|
||||
## Related
|
||||
|
||||
- ADR-0021: Notification Architecture
|
||||
- ADR-0015: CI Notifications and Semantic Versioning
|
||||
108
decisions/0023-valkey-ml-caching.md
Normal file
108
decisions/0023-valkey-ml-caching.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# ADR-0023: Valkey for ML Inference Caching
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The AI/ML platform requires caching infrastructure for multiple use cases:
|
||||
|
||||
1. **KV-Cache Offloading**: vLLM can offload key-value cache tensors to external storage, reducing GPU memory pressure and enabling longer context windows
|
||||
2. **Embedding Cache**: Frequently requested embeddings can be cached to avoid redundant GPU computation
|
||||
3. **Session State**: Conversation history and intermediate results for multi-turn interactions
|
||||
4. **Ray Object Store Spillover**: Large Ray objects can spill to external storage when memory is constrained
|
||||
|
||||
Previously, two separate Valkey instances existed:
|
||||
- `valkey` - General-purpose with 10Gi persistent storage
|
||||
- `mlcache` - ML-optimized ephemeral cache with 4GB memory limit and LRU eviction
|
||||
|
||||
Analysis revealed that `mlcache` had **zero consumers** in the codebase - no services were actually connecting to it.
|
||||
|
||||
## Decision
|
||||
|
||||
### Consolidate to Single Valkey Instance
|
||||
|
||||
Remove `mlcache` and use the existing `valkey` instance for all caching needs. When vLLM KV-cache offloading is implemented in the RayService deployment, configure it to use the existing Valkey instance.
|
||||
|
||||
### Valkey Configuration
|
||||
|
||||
The current `valkey` instance at `valkey.ai-ml.svc.cluster.local:6379`:
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| Persistence | 10Gi Longhorn PVC | Survive restarts, cache warm-up |
|
||||
| Memory | 512Mi request, 2Gi limit | Sufficient for current workloads |
|
||||
| Auth | Disabled | Internal cluster-only access |
|
||||
| Metrics | Prometheus ServiceMonitor | Observability |
|
||||
|
||||
### Future: vLLM KV-Cache Integration
|
||||
|
||||
When implementing LMCache or similar KV-cache offloading for vLLM:
|
||||
|
||||
```python
|
||||
# In ray_serve/serve_llm.py
|
||||
from vllm import AsyncLLMEngine
|
||||
|
||||
engine = AsyncLLMEngine.from_engine_args(
|
||||
engine_args,
|
||||
kv_cache_config={
|
||||
"type": "redis",
|
||||
"url": "redis://valkey.ai-ml.svc.cluster.local:6379",
|
||||
"prefix": "vllm:kv:",
|
||||
"ttl": 3600, # 1 hour cache lifetime
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
If memory pressure becomes an issue, scale Valkey resources:
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
memory: "8Gi" # Increase for larger KV-cache
|
||||
extraArgs:
|
||||
- --maxmemory
|
||||
- 6gb
|
||||
- --maxmemory-policy
|
||||
- allkeys-lru
|
||||
```
|
||||
|
||||
### Key Prefixes Convention
|
||||
|
||||
To avoid collisions when multiple services share Valkey:
|
||||
|
||||
| Service | Prefix | Example Key |
|
||||
|---------|--------|-------------|
|
||||
| vLLM KV-Cache | `vllm:kv:` | `vllm:kv:layer0:tok123` |
|
||||
| Embeddings Cache | `emb:` | `emb:sha256:abc123` |
|
||||
| Ray State | `ray:` | `ray:actor:xyz` |
|
||||
| Session State | `session:` | `session:user:123` |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Reduced complexity**: One cache instance instead of two
|
||||
- **Resource efficiency**: No unused mlcache consuming 4GB memory allocation
|
||||
- **Operational simplicity**: Single point of monitoring and maintenance
|
||||
- **Cost savings**: One less PVC, pod, and service to manage
|
||||
|
||||
### Negative
|
||||
|
||||
- **Shared resource contention**: All workloads share the same cache
|
||||
- **Single point of failure**: Cache unavailability affects all consumers
|
||||
|
||||
### Mitigations
|
||||
|
||||
- **Namespace isolation via prefixes**: Prevents key collisions
|
||||
- **LRU eviction**: Automatic cleanup when memory is constrained
|
||||
- **Persistent storage**: Cache survives pod restarts
|
||||
- **Monitoring**: Prometheus metrics for memory usage alerts
|
||||
|
||||
## References
|
||||
|
||||
- [vLLM Distributed KV-Cache](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
||||
- [LMCache Project](https://github.com/LMCache/LMCache)
|
||||
- [Valkey Documentation](https://valkey.io/docs/)
|
||||
- [Ray External Storage](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html)
|
||||
184
decisions/0024-ray-repository-structure.md
Normal file
184
decisions/0024-ray-repository-structure.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# ADR-0024: Ray Repository Structure
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
|
||||
2026-02-03
|
||||
|
||||
## Context
|
||||
|
||||
| Factor | Details |
|
||||
|--------|---------|
|
||||
| Problem | Need to document the Ray-specific repository structure |
|
||||
| Impact | Clarity on where Ray components live post-migration |
|
||||
| Current State | kuberay-images standalone, ray-serve needs extraction |
|
||||
| Goal | Clean separation with independent release cycles |
|
||||
|
||||
### Historical Context
|
||||
|
||||
`llm-workflows` was the original monolithic repository containing all ML/AI infrastructure code. It has been **archived** after being fully decomposed into focused, independent repositories:
|
||||
|
||||
| Repository | Purpose |
|
||||
|------------|---------|
|
||||
| `ai-apps` | Gradio applications (STT, TTS, embeddings UIs) |
|
||||
| `ai-pipelines` | Kubeflow pipeline definitions |
|
||||
| `ai-services` | Core ML service implementations |
|
||||
| `chat-handler` | Chat orchestration and routing |
|
||||
| `handler-base` | Base handler framework |
|
||||
| `pipeline-bridge` | Bridge between pipelines and services |
|
||||
| `stt-module` | Speech-to-text service |
|
||||
| `tts-module` | Text-to-speech service |
|
||||
| `voice-assistant` | Voice assistant integration |
|
||||
| `gradio-ui` | Shared Gradio UI components |
|
||||
| `kuberay-images` | GPU-specific Ray worker base images |
|
||||
| `ntfy-discord` | Notification bridge |
|
||||
| `spark-analytics-jobs` | Spark batch analytics |
|
||||
| `flink-analytics-jobs` | Flink streaming analytics |
|
||||
|
||||
### Remaining Ray Component
|
||||
|
||||
The `ray-serve` code still needs a dedicated repository for Ray Serve model inference services.
|
||||
|
||||
| Component | Current Location | Purpose |
|
||||
|-----------|------------------|---------|
|
||||
| kuberay-images | `kuberay-images/` (standalone) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
|
||||
| ray-serve | `llm-workflows/ray-serve/` | Ray Serve inference services |
|
||||
| llm-workflows | `llm-workflows/` | Pipelines, handlers, STT/TTS, embeddings |
|
||||
|
||||
### Problems with Current Structure
|
||||
|
||||
1. **Tight Coupling**: ray-serve changes require llm-workflows repo access
|
||||
2. **CI/CD Complexity**: Building ray-serve images triggers unrelated workflow steps
|
||||
3. **Version Management**: Can't independently version ray-serve deployments
|
||||
4. **Team Access**: Contributors to ray-serve need access to entire llm-workflows repo
|
||||
5. **Build Times**: Changes to unrelated code can trigger ray-serve rebuilds
|
||||
|
||||
## Decision
|
||||
|
||||
**Establish two dedicated Ray repositories with distinct purposes:**
|
||||
|
||||
| Repository | Type | Contents | Release Cycle |
|
||||
|------------|------|----------|---------------|
|
||||
| `kuberay-images` | Docker images | Ray worker base images (GPU-specific) | On dependency updates |
|
||||
| `ray-serve` | PyPI package | Ray Serve application code | Per model/feature update |
|
||||
|
||||
### Key Design: Dynamic Code Loading
|
||||
|
||||
Ray Serve applications are deployed as **PyPI packages**, not baked into Docker images. This enables:
|
||||
|
||||
- **Dynamic Decoupling**: Update model serving logic without rebuilding containers
|
||||
- **Runtime Flexibility**: Ray cluster pulls code via `pip install` at runtime
|
||||
- **Faster Iteration**: Code changes don't require image rebuilds or pod restarts
|
||||
- **Version Pinning**: Kubernetes manifests specify package versions independently
|
||||
|
||||
### Repository Structure
|
||||
|
||||
```
|
||||
kuberay-images/ # Docker images - GPU runtime environments
|
||||
├── Dockerfile.ray-worker-nvidia
|
||||
├── Dockerfile.ray-worker-rdna2
|
||||
├── Dockerfile.ray-worker-strixhalo
|
||||
├── Dockerfile.ray-worker-intel
|
||||
├── Makefile
|
||||
└── .gitea/workflows/
|
||||
└── build-push.yaml # Builds & pushes to container registry
|
||||
|
||||
ray-serve/ # PyPI package - application code
|
||||
├── src/
|
||||
│ └── ray_serve/
|
||||
│ ├── __init__.py
|
||||
│ ├── model_configs.py
|
||||
│ └── serve_apps.py
|
||||
├── pyproject.toml
|
||||
├── README.md
|
||||
└── .gitea/workflows/
|
||||
└── publish-ray-serve.yaml # Publishes to PyPI registry
|
||||
```
|
||||
|
||||
**Note**: Kubernetes deployment manifests live in `homelab-k8s2`, not in either Ray repo. This maintains separation between:
|
||||
- **Infrastructure** (kuberay-images) - How to run Ray workers
|
||||
- **Application** (ray-serve) - What code to run
|
||||
- **Orchestration** (homelab-k8s2) - Where and when to deploy
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ RAY INFRASTRUCTURE │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────────────┴───────────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌───────────────┐
|
||||
│ kuberay-images│ │ ray-serve │
|
||||
│ │ │ │
|
||||
│ Base worker │ │ PyPI package │
|
||||
│ Docker images │ │ Ray Serve │
|
||||
│ │ │ application │
|
||||
│ NVIDIA/AMD/ │ │ │
|
||||
│ Intel GPUs │ │ Model configs │
|
||||
└───────────────┘ └───────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌───────────────┐
|
||||
│ Container │ │ PyPI │
|
||||
│ Registry │ │ Registry │
|
||||
│ registry.lab/ │ │ registry.lab/ │
|
||||
│ kuberay/* │ │ pypi/ray-serve│
|
||||
└───────────────┘ └───────────────┘
|
||||
│ │
|
||||
└───────────────────┬───────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────┐
|
||||
│ Ray Cluster │
|
||||
│ │
|
||||
│ 1. Pull container │
|
||||
│ 2. pip install │
|
||||
│ ray-serve │
|
||||
│ 3. Run serve app │
|
||||
└───────────────────┘
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Dynamic Updates**: Deploy new model serving code without rebuilding images
|
||||
- **Independent Releases**: Containers and application code versioned separately
|
||||
- **Faster Iteration**: PyPI publish is seconds vs minutes for Docker builds
|
||||
- **Clear Separation**: Infrastructure (images) vs Application (code) vs Orchestration (k8s)
|
||||
- **Runtime Flexibility**: Same container can run different ray-serve versions
|
||||
|
||||
### Negative
|
||||
|
||||
- **Runtime Dependencies**: Pod startup requires `pip install` (cached in practice)
|
||||
- **Version Coordination**: Must track compatible versions between kuberay-images and ray-serve
|
||||
|
||||
### Migration Steps
|
||||
|
||||
1. ✅ `kuberay-images` already exists as standalone repo
|
||||
2. ✅ `llm-workflows` archived - all components extracted to dedicated repos
|
||||
3. [ ] Create `ray-serve` repo on Gitea
|
||||
4. [ ] Move `.gitea/workflows/publish-ray-serve.yaml` to new repo
|
||||
5. [ ] Set up pyproject.toml for PyPI publishing
|
||||
6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z`
|
||||
7. [ ] Verify Ray cluster pulls package correctly at runtime
|
||||
|
||||
## Version Compatibility Matrix
|
||||
|
||||
| kuberay-images | ray-serve | Notes |
|
||||
|----------------|-----------|-------|
|
||||
| 1.0.0 | 1.0.0 | Initial structure |
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-0020: Internal Registry for CI/CD](./ADR-0020-internal-registry-for-cicd.md)
|
||||
- [KubeRay Documentation](https://ray-project.github.io/kuberay/)
|
||||
- [Ray Serve Documentation](https://docs.ray.io/en/latest/serve/index.html)
|
||||
- [KubeRay Documentation](https://ray-project.github.io/kuberay/)
|
||||
- [Ray Serve Documentation](https://docs.ray.io/en/latest/serve/index.html)
|
||||
Reference in New Issue
Block a user