updating to match everything in my homelab.

This commit is contained in:
2026-02-05 16:13:53 -05:00
parent f8787379c5
commit 80fb911e22
30 changed files with 3107 additions and 7 deletions

View File

@@ -4,11 +4,32 @@ This directory contains additional architecture diagrams beyond the main C4 diag
## Available Diagrams
| File | Description |
|------|-------------|
| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
| File | Description | Related ADR |
|------|-------------|-------------|
| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution | ADR-0005 |
| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow | ADR-0003 |
| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow | ADR-0003 |
| [gitops-flux.mmd](gitops-flux.mmd) | GitOps reconciliation loop | ADR-0006 |
| [dual-workflow-engines.mmd](dual-workflow-engines.mmd) | Argo vs Kubeflow decision flow | ADR-0009 |
| [kuberay-unified-backend.mmd](kuberay-unified-backend.mmd) | RayService endpoints and GPU allocation | ADR-0011 |
| [secrets-management.mmd](secrets-management.mmd) | SOPS bootstrap vs Vault runtime | ADR-0017 |
| [security-policy-enforcement.mmd](security-policy-enforcement.mmd) | Gatekeeper admission + Trivy scanning | ADR-0018 |
| [handler-deployment.mmd](handler-deployment.mmd) | Ray cluster platform layers | ADR-0019 |
| [internal-registry.mmd](internal-registry.mmd) | Internal vs external registry paths | ADR-0020 |
| [notification-architecture.mmd](notification-architecture.mmd) | ntfy hub with sources and consumers | ADR-0021 |
| [ntfy-discord-bridge.mmd](ntfy-discord-bridge.mmd) | ntfy to Discord message flow | ADR-0022 |
| [ray-repository-structure.mmd](ray-repository-structure.mmd) | Ray package build and loading | ADR-0024 |
| [observability-stack.mmd](observability-stack.mmd) | Prometheus + ClickStack telemetry flow | ADR-0025 |
| [storage-strategy.mmd](storage-strategy.mmd) | Longhorn + NFS dual-tier storage | ADR-0026 |
| [database-strategy.mmd](database-strategy.mmd) | CloudNativePG cluster management | ADR-0027 |
| [authentik-sso.mmd](authentik-sso.mmd) | Authentik authentication flow | ADR-0028 |
| [user-registration-workflow.mmd](user-registration-workflow.mmd) | User registration and approval | ADR-0029 |
| [velero-backup.mmd](velero-backup.mmd) | Velero backup and restore flow | ADR-0032 |
| [analytics-lakehouse.mmd](analytics-lakehouse.mmd) | Data analytics lakehouse architecture | ADR-0033 |
| [volcano-scheduling.mmd](volcano-scheduling.mmd) | Volcano batch scheduler and queues | ADR-0034 |
| [cluster-topology.mmd](cluster-topology.mmd) | Node topology (x86/ARM64/GPU) | ADR-0035 |
| [renovate-workflow.mmd](renovate-workflow.mmd) | Renovate dependency update cycle | ADR-0036 |
| [node-naming.mmd](node-naming.mmd) | D&D-themed node naming conventions | ADR-0037 |
## Rendering Diagrams

View File

@@ -0,0 +1,85 @@
%% Data Analytics Lakehouse Architecture
%% Related: ADR-0033
flowchart TB
subgraph Ingestion["Data Ingestion"]
Kafka["Kafka<br/>Event Streams"]
APIs["REST APIs<br/>Batch Loads"]
Files["File Drops<br/>S3/NFS"]
end
subgraph Processing["Processing Layer"]
subgraph Batch["Batch Processing"]
Spark["Apache Spark<br/>spark-operator"]
end
subgraph Stream["Stream Processing"]
Flink["Apache Flink<br/>flink-operator"]
end
subgraph Realtime["Real-time"]
RisingWave["RisingWave<br/>Streaming SQL"]
end
end
subgraph Catalog["Lakehouse Catalog"]
Nessie["Nessie<br/>Git-like Versioning"]
Iceberg["Apache Iceberg<br/>Table Format"]
end
subgraph Storage["Storage Layer"]
S3["S3 (MinIO)<br/>Object Storage"]
Parquet["Parquet Files<br/>Columnar Format"]
end
subgraph Query["Query Layer"]
Trino["Trino<br/>Distributed SQL"]
end
subgraph Serve["Serving Layer"]
Grafana["Grafana<br/>Dashboards"]
Jupyter["JupyterHub<br/>Notebooks"]
Apps["Applications<br/>REST APIs"]
end
subgraph Metadata["Metadata Store"]
PostgreSQL["CloudNativePG<br/>analytics-db"]
end
Kafka --> Flink
Kafka --> RisingWave
APIs --> Spark
Files --> Spark
Spark --> Nessie
Flink --> Nessie
RisingWave --> Nessie
Nessie --> Iceberg
Iceberg --> S3
S3 --> Parquet
Nessie --> PostgreSQL
Trino --> Nessie
Trino --> Iceberg
Trino --> Grafana
Trino --> Jupyter
Trino --> Apps
classDef ingest fill:#4a5568,stroke:#718096,color:#fff
classDef batch fill:#3182ce,stroke:#2b6cb0,color:#fff
classDef stream fill:#38a169,stroke:#2f855a,color:#fff
classDef catalog fill:#d69e2e,stroke:#b7791f,color:#fff
classDef storage fill:#718096,stroke:#4a5568,color:#fff
classDef query fill:#805ad5,stroke:#6b46c1,color:#fff
classDef serve fill:#e53e3e,stroke:#c53030,color:#fff
classDef meta fill:#319795,stroke:#2c7a7b,color:#fff
class Kafka,APIs,Files ingest
class Spark batch
class Flink,RisingWave stream
class Nessie,Iceberg catalog
class S3,Parquet storage
class Trino query
class Grafana,Jupyter,Apps serve
class PostgreSQL meta

View File

@@ -0,0 +1,84 @@
```plaintext
%% Authentik SSO Strategy (ADR-0028)
%% Flowchart showing authentication flow stages
flowchart TB
subgraph user["👤 User"]
browser["Browser"]
end
subgraph ingress["🌐 Ingress"]
traefik["Envoy Gateway"]
end
subgraph apps["📦 Applications"]
direction LR
oidc_app["OIDC Apps<br/>Gitea, Grafana,<br/>ArgoCD, Affine"]
proxy_app["Proxy Apps<br/>MLflow, Kubeflow"]
end
subgraph authentik["🔐 Authentik"]
direction TB
subgraph components["Components"]
server["Server<br/>(API)"]
worker["Worker<br/>(Tasks)"]
outpost["Outpost<br/>(Proxy Auth)"]
end
subgraph flow["Authentication Flow"]
direction LR
f1["1⃣ Login<br/>Stage"]
f2["2⃣ Username<br/>Identification"]
f3["3⃣ Password<br/>Validation"]
f4["4⃣ MFA<br/>Challenge"]
f5["5⃣ Session<br/>Created"]
end
subgraph providers["Providers"]
oidc_prov["OIDC Provider"]
proxy_prov["Proxy Provider"]
end
end
subgraph storage["💾 Storage"]
redis["Redis<br/>(Cache)"]
postgres["PostgreSQL<br/>(CNPG)"]
end
%% User flow
browser --> traefik
traefik --> apps
%% OIDC flow
oidc_app -->|"Redirect to auth"| server
server --> flow
f1 --> f2 --> f3 --> f4 --> f5
flow --> oidc_prov
oidc_prov -->|"JWT token"| oidc_app
%% Proxy flow
proxy_app -->|"Forward auth"| outpost
outpost --> server
server --> flow
proxy_prov --> outpost
%% Storage
server --> redis
server --> postgres
classDef user fill:#3498db,color:white
classDef ingress fill:#f39c12,color:black
classDef app fill:#27ae60,color:white
classDef authentik fill:#9b59b6,color:white
classDef storage fill:#e74c3c,color:white
classDef flow fill:#1abc9c,color:white
class browser user
class traefik ingress
class oidc_app,proxy_app app
class server,worker,outpost,oidc_prov,proxy_prov authentik
class redis,postgres storage
class f1,f2,f3,f4,f5 flow
```

View File

@@ -0,0 +1,66 @@
%% Cluster Node Topology
%% Related: ADR-0035, ADR-0011, ADR-0037
flowchart TB
subgraph Cluster["Homelab Kubernetes Cluster (14 nodes)"]
subgraph ControlPlane["👑 Control Plane (Companions of the Hall)"]
Bruenor["bruenor<br/>Intel N100"]
Catti["catti<br/>Intel N100"]
Storm["storm<br/>Intel N100"]
end
subgraph GPUNodes["🧙 Wizards (GPU Workers)"]
Khelben["khelben<br/>Radeon 8060S 64GB<br/>🎮 Primary AI"]
Elminster["elminster<br/>RTX 2070 8GB<br/>🎮 CUDA"]
Drizzt["drizzt<br/>Radeon 680M<br/>🎮 ROCm"]
Danilo["danilo<br/>Intel Arc A770<br/>🎮 Intel"]
Regis["regis<br/>NVIDIA GPU<br/>🎮 CUDA"]
end
subgraph CPUNodes["⚔️ Fighters (CPU Workers)"]
Wulfgar["wulfgar<br/>Intel x86_64"]
end
subgraph ARMWorkers["🗡️ Rogues (ARM64 Raspberry Pi)"]
Durnan["durnan<br/>Pi 4 8GB"]
Elaith["elaith<br/>Pi 4 8GB"]
Jarlaxle["jarlaxle<br/>Pi 4 8GB"]
Mirt["mirt<br/>Pi 4 8GB"]
Volo["volo<br/>Pi 4 8GB"]
end
end
subgraph Workloads["Workload Placement"]
AIInference["AI Inference<br/>→ Khelben"]
MLTraining["ML Training<br/>→ GPU Nodes"]
EdgeServices["Lightweight Services<br/>→ ARM64"]
General["General Workloads<br/>→ CPU + ARM64"]
end
subgraph Storage["Storage Affinity"]
Longhorn["Longhorn<br/>x86_64 only"]
NFS["NFS<br/>All nodes"]
end
AIInference -.-> Khelben
MLTraining -.-> GPUNodes
EdgeServices -.-> ARMWorkers
General -.-> CPUNodes
General -.-> ARMWorkers
Longhorn -.->|Excluded| ARMWorkers
NFS --> Cluster
classDef control fill:#2563eb,stroke:#1d4ed8,color:#fff
classDef gpu fill:#7c3aed,stroke:#5b21b6,color:#fff
classDef cpu fill:#dc2626,stroke:#b91c1c,color:#fff
classDef arm fill:#059669,stroke:#047857,color:#fff
classDef workload fill:#9f7aea,stroke:#805ad5,color:#fff
classDef storage fill:#ed8936,stroke:#dd6b20,color:#fff
class Bruenor,Catti,Storm control
class Khelben,Elminster,Drizzt,Danilo,Regis gpu
class Wulfgar cpu
class Durnan,Elaith,Jarlaxle,Mirt,Volo arm
class AIInference,MLTraining,EdgeServices,General workload
class Longhorn,NFS storage

View File

@@ -0,0 +1,96 @@
```plaintext
%% Database Strategy with CloudNativePG (ADR-0027)
%% C4 Component diagram showing CNPG operator and clusters
flowchart TB
subgraph operator["🎛️ CNPG Operator"]
cnpg["CloudNativePG<br/>Controller<br/>(cnpg-system)"]
end
subgraph clusters["📊 PostgreSQL Clusters"]
direction LR
subgraph gitea_pg["gitea-pg"]
direction TB
g_primary["🔵 Primary"]
g_replica1["⚪ Replica"]
g_replica2["⚪ Replica"]
g_bouncer["🔗 PgBouncer"]
end
subgraph authentik_db["authentik-db"]
direction TB
a_primary["🔵 Primary"]
a_replica1["⚪ Replica"]
a_replica2["⚪ Replica"]
a_bouncer["🔗 PgBouncer"]
end
subgraph companions_db["companions-db"]
direction TB
c_primary["🔵 Primary"]
c_replica1["⚪ Replica"]
c_replica2["⚪ Replica"]
c_bouncer["🔗 PgBouncer"]
end
subgraph mlflow_db["mlflow-db"]
direction TB
m_primary["🔵 Primary"]
end
end
subgraph storage["💾 Storage"]
longhorn["Longhorn PVCs<br/>(NVMe/SSD)"]
s3["S3 Backups<br/>(barman)"]
end
subgraph services["🔌 Service Discovery"]
direction TB
rw["-rw (read-write)"]
ro["-ro (read-only)"]
pooler["-pooler-rw<br/>(PgBouncer)"]
end
subgraph apps["📦 Applications"]
gitea["Gitea"]
authentik["Authentik"]
companions["Companions"]
mlflow["MLflow"]
end
%% Operator manages clusters
cnpg -->|"Manages"| clusters
%% Storage connections
clusters --> longhorn
clusters -->|"WAL archiving"| s3
%% Service routing
g_bouncer --> rw
a_bouncer --> rw
c_bouncer --> rw
g_replica1 --> ro
g_replica2 --> ro
%% App connections
gitea -->|"gitea-pg-pooler-rw"| g_bouncer
authentik -->|"authentik-db-pooler-rw"| a_bouncer
companions -->|"companions-db-pooler-rw"| c_bouncer
mlflow -->|"mlflow-db-rw"| m_primary
classDef operator fill:#e74c3c,color:white
classDef primary fill:#3498db,color:white
classDef replica fill:#95a5a6,color:white
classDef bouncer fill:#9b59b6,color:white
classDef storage fill:#27ae60,color:white
classDef app fill:#f39c12,color:black
class cnpg operator
class g_primary,a_primary,c_primary,m_primary primary
class g_replica1,g_replica2,a_replica1,a_replica2,c_replica1,c_replica2 replica
class g_bouncer,a_bouncer,c_bouncer bouncer
class longhorn,s3 storage
class gitea,authentik,companions,mlflow app
```

View File

@@ -0,0 +1,73 @@
```plaintext
%% Dual Workflow Engine Strategy (ADR-0009)
%% Flowchart showing Argo vs Kubeflow decision and integration
flowchart TB
subgraph trigger["🎯 Workflow Triggers"]
nats["NATS Event"]
api["API Call"]
schedule["Cron Schedule"]
end
subgraph decision["❓ Which Engine?"]
question{{"Workflow Type?"}}
end
subgraph kubeflow["🔬 Kubeflow Pipelines"]
direction TB
kfp_train["ML Training<br/>✅ Component caching"]
kfp_eval["Model Evaluation<br/>✅ Metric tracking"]
kfp_exp["Experiment Comparison<br/>✅ MLflow integration"]
end
subgraph argo["⚡ Argo Workflows"]
direction TB
argo_dag["Complex DAG<br/>✅ Advanced control flow"]
argo_batch["Batch Processing<br/>✅ Parallelization"]
argo_ingest["Document Ingestion<br/>✅ Simple steps"]
end
subgraph hybrid["🔗 Hybrid Pattern"]
direction TB
argo_orch["Argo Orchestrates"]
kfp_step["KFP via API"]
argo_orch --> kfp_step
end
subgraph integration["📡 Integration Layer"]
direction TB
events["Argo Events<br/>EventSource + Sensor"]
end
%% Flow from triggers
nats --> events
api --> decision
schedule --> events
events --> decision
%% Decision branches
question -->|"ML training<br/>with caching"| kubeflow
question -->|"Complex DAG<br/>batch jobs"| argo
question -->|"ML + complex<br/>orchestration"| hybrid
%% Kubeflow use cases
kfp_train --> kfp_eval
kfp_eval --> kfp_exp
%% Argo use cases
argo_dag --> argo_batch
argo_batch --> argo_ingest
classDef trigger fill:#f39c12,color:black
classDef kubeflow fill:#4a90d9,color:white
classDef argo fill:#ef6c00,color:white
classDef hybrid fill:#8e44ad,color:white
classDef integration fill:#27ae60,color:white
class nats,api,schedule trigger
class kfp_train,kfp_eval,kfp_exp kubeflow
class argo_dag,argo_batch,argo_ingest argo
class argo_orch,kfp_step hybrid
class events integration
```

57
diagrams/gitops-flux.mmd Normal file
View File

@@ -0,0 +1,57 @@
```plaintext
%% GitOps Reconciliation Loop (ADR-0006)
%% Flowchart showing Flux CD GitOps workflow
flowchart TB
subgraph git["📂 Git Repositories"]
direction TB
homelab["homelab-k8s2<br/>(cluster config)"]
apps["Application Repos<br/>(argo, kubeflow, etc.)"]
end
subgraph flux["⚙️ Flux Controllers"]
direction TB
source["Source Controller<br/>📥 Fetches repos"]
kustomize["Kustomize Controller<br/>🔧 Applies manifests"]
helm["Helm Controller<br/>📦 Manages charts"]
notification["Notification Controller<br/>📢 Alerts"]
end
subgraph k8s["☸️ Kubernetes Cluster"]
direction TB
secrets["🔐 SOPS Secrets<br/>(Age decrypted)"]
resources["📋 Deployed Resources<br/>(Pods, Services, etc.)"]
drift["🔄 Drift Detection"]
end
subgraph notify["📱 Notifications"]
ntfy["ntfy<br/>(push alerts)"]
end
%% GitOps flow
homelab -->|"GitRepository CR"| source
apps -->|"GitRepository CR"| source
source -->|"Fetches every 5m"| kustomize
source -->|"Fetches charts"| helm
kustomize -->|"Decrypts with Age"| secrets
kustomize -->|"kubectl apply"| resources
helm -->|"helm upgrade"| resources
resources -->|"Actual state"| drift
drift -->|"Compares to Git"| kustomize
drift -->|"Auto-corrects"| resources
notification -->|"Success/failure"| ntfy
classDef repo fill:#f5a623,color:black
classDef controller fill:#4a90d9,color:white
classDef cluster fill:#50c878,color:white
classDef alert fill:#9b59b6,color:white
class homelab,apps repo
class source,kustomize,helm,notification controller
class secrets,resources,drift cluster
class ntfy alert
```

View File

@@ -0,0 +1,67 @@
```plaintext
%% Handler Deployment Strategy (ADR-0019)
%% C4 Component diagram showing platform layers with Ray cluster
flowchart TB
subgraph platform["🏗️ Platform Layer"]
direction LR
kubeflow["📊 Kubeflow<br/>Pipelines"]
kserve["🎯 KServe<br/>(visibility)"]
mlflow["📈 MLflow<br/>(registry)"]
end
subgraph ray["⚡ Ray Cluster"]
direction TB
subgraph gpu_apps["🎮 GPU Inference (Workers)"]
direction LR
llm["/llm<br/>vLLM<br/>🟢 khelben 0.95 GPU"]
whisper["/whisper<br/>Whisper<br/>🟡 elminster 0.5 GPU"]
tts["/tts<br/>XTTS<br/>🟡 elminster 0.5 GPU"]
embeddings["/embeddings<br/>BGE<br/>🔴 drizzt 0.8 GPU"]
reranker["/reranker<br/>BGE<br/>🔵 danilo 0.8 GPU"]
end
subgraph cpu_apps["🖥️ CPU Handlers (Head Node)"]
direction LR
chat["/chat<br/>ChatHandler<br/>0 GPU"]
voice["/voice<br/>VoiceHandler<br/>0 GPU"]
end
end
subgraph support["🔧 Supporting Services"]
direction LR
nats["📨 NATS<br/>(events)"]
milvus["🔍 Milvus<br/>(vectors)"]
valkey["💾 Valkey<br/>(cache)"]
end
subgraph pypi["📦 Package Registry"]
gitea_pypi["Gitea PyPI<br/>• handler-base<br/>• chat-handler<br/>• voice-assistant"]
end
%% Connections
kubeflow --> ray
kserve --> ray
mlflow --> ray
cpu_apps -->|"Ray internal calls"| gpu_apps
cpu_apps --> nats
cpu_apps --> milvus
cpu_apps --> valkey
gitea_pypi -->|"pip install<br/>runtime_env"| cpu_apps
classDef platform fill:#9b59b6,color:white
classDef gpu fill:#e74c3c,color:white
classDef cpu fill:#3498db,color:white
classDef support fill:#27ae60,color:white
classDef registry fill:#f39c12,color:black
class kubeflow,kserve,mlflow platform
class llm,whisper,tts,embeddings,reranker gpu
class chat,voice cpu
class nats,milvus,valkey support
class gitea_pypi registry
```

View File

@@ -0,0 +1,53 @@
```plaintext
%% Internal Registry for CI/CD (ADR-0020)
%% Flowchart showing dual-path for external vs internal access
flowchart TB
subgraph external["🌐 External Access"]
internet["Internet"]
cloudflare["☁️ Cloudflare<br/>⚠️ 100MB upload limit"]
external_url["git.daviestechlabs.io"]
end
subgraph internal["🏠 Internal Access"]
internal_url["registry.lab.daviestechlabs.io<br/>✅ No upload limits"]
end
subgraph gitea["📦 Gitea Instance"]
direction TB
git_server["Git Server"]
docker_registry["Docker Registry"]
pypi_registry["PyPI Registry"]
end
subgraph runners["🏃 CI/CD Runners"]
gitea_runner["Gitea Actions Runner<br/>(in-cluster)"]
end
subgraph operations["📋 Operations"]
small_ops["Small Operations<br/>• git clone/push<br/>• pip install<br/>• docker pull"]
large_ops["Large Uploads<br/>• docker push (20GB+)<br/>• pypi upload"]
end
%% External path (limited)
internet --> cloudflare
cloudflare -->|"100MB limit"| external_url
external_url --> gitea
small_ops --> cloudflare
%% Internal path (unlimited)
gitea_runner -->|"Direct"| internal_url
internal_url --> gitea
large_ops --> internal_url
classDef external fill:#e74c3c,color:white
classDef internal fill:#27ae60,color:white
classDef gitea fill:#f39c12,color:black
classDef runner fill:#3498db,color:white
class internet,cloudflare,external_url external
class internal_url internal
class git_server,docker_registry,pypi_registry gitea
class gitea_runner runner
```

View File

@@ -0,0 +1,77 @@
```plaintext
%% KubeRay Unified GPU Backend (ADR-0011)
%% C4 Component diagram showing RayService endpoints and GPU allocation
flowchart TB
subgraph clients["🔌 Clients"]
chat["Chat Handler"]
voice["Voice Handler"]
end
subgraph rayservice["⚡ KubeRay RayService"]
endpoint["ai-inference-serve-svc:8000"]
subgraph deployments["Ray Serve Deployments"]
direction TB
subgraph strixhalo["🟢 khelben (Strix Halo 64GB)"]
llm["/llm<br/>vLLM 70B<br/>0.95 GPU"]
end
subgraph rtx2070["🟡 elminster (RTX 2070 8GB)"]
whisper["/whisper<br/>Whisper v3<br/>0.5 GPU"]
tts["/tts<br/>XTTS<br/>0.5 GPU"]
end
subgraph radeon680m["🔴 drizzt (Radeon 680M 12GB)"]
embeddings["/embeddings<br/>BGE-Large<br/>0.8 GPU"]
end
subgraph intelarc["🔵 danilo (Intel Arc)"]
reranker["/reranker<br/>BGE-Reranker<br/>0.8 GPU"]
end
end
end
subgraph kserve["🎯 KServe Compatibility Layer"]
direction TB
svc1["whisper-predictor.ai-ml"]
svc2["tts-predictor.ai-ml"]
svc3["llm-predictor.ai-ml"]
svc4["embeddings-predictor.ai-ml"]
svc5["reranker-predictor.ai-ml"]
end
%% Client connections
chat --> endpoint
voice --> endpoint
%% Path routing
endpoint --> llm
endpoint --> whisper
endpoint --> tts
endpoint --> embeddings
endpoint --> reranker
%% KServe aliases
svc1 -->|"ExternalName"| endpoint
svc2 -->|"ExternalName"| endpoint
svc3 -->|"ExternalName"| endpoint
svc4 -->|"ExternalName"| endpoint
svc5 -->|"ExternalName"| endpoint
classDef client fill:#3498db,color:white
classDef endpoint fill:#9b59b6,color:white
classDef amd fill:#ED1C24,color:white
classDef nvidia fill:#76B900,color:white
classDef intel fill:#0071C5,color:white
classDef kserve fill:#f39c12,color:black
class chat,voice client
class endpoint endpoint
class llm,embeddings amd
class whisper,tts nvidia
class reranker intel
class svc1,svc2,svc3,svc4,svc5 kserve
```

64
diagrams/node-naming.mmd Normal file
View File

@@ -0,0 +1,64 @@
%% Node Naming Conventions - D&D Theme
%% Related: ADR-0037
flowchart TB
subgraph Cluster["Homelab Kubernetes Cluster (14 nodes)"]
subgraph ControlPlane["👑 Control Plane (Companions of the Hall)"]
Bruenor["bruenor<br/>Intel N100<br/><i>Dwarf King</i>"]
Catti["catti<br/>Intel N100<br/><i>Catti-brie</i>"]
Storm["storm<br/>Intel N100<br/><i>Storm Silverhand</i>"]
end
subgraph Wizards["🧙 Wizards (GPU Spellcasters)"]
Khelben["khelben<br/>Radeon 8060S 64GB<br/><i>The Blackstaff</i>"]
Elminster["elminster<br/>RTX 2070 8GB<br/><i>Sage of Shadowdale</i>"]
Drizzt["drizzt<br/>Radeon 680M<br/><i>Ranger-Mage</i>"]
Danilo["danilo<br/>Intel Arc A770<br/><i>Bard-Wizard</i>"]
Regis["regis<br/>NVIDIA GPU<br/><i>Halfling Spellthief</i>"]
end
subgraph Rogues["🗡️ Rogues (ARM64 Edge Nodes)"]
Durnan["durnan<br/>Pi 4 8GB<br/><i>Yawning Portal</i>"]
Elaith["elaith<br/>Pi 4 8GB<br/><i>The Serpent</i>"]
Jarlaxle["jarlaxle<br/>Pi 4 8GB<br/><i>Bregan D'aerthe</i>"]
Mirt["mirt<br/>Pi 4 8GB<br/><i>Old Wolf</i>"]
Volo["volo<br/>Pi 4 8GB<br/><i>Famous Author</i>"]
end
subgraph Fighters["⚔️ Fighters (x86 CPU Workers)"]
Wulfgar["wulfgar<br/>Intel x86_64<br/><i>Barbarian of Icewind Dale</i>"]
end
end
subgraph Infrastructure["🏰 Locations (Off-Cluster Infrastructure)"]
Candlekeep["📚 candlekeep<br/>Synology NAS<br/>nfs-default<br/><i>Library Fortress</i>"]
Neverwinter["❄️ neverwinter<br/>TrueNAS Scale (SSD)<br/>nfs-fast<br/><i>Jewel of the North</i>"]
Waterdeep["🏙️ waterdeep<br/>Mac Mini<br/>Dev Workstation<br/><i>City of Splendors</i>"]
end
subgraph Workloads["Workload Routing"]
AI["AI/ML Inference"] --> Wizards
Edge["Edge Services"] --> Rogues
Compute["General Compute"] --> Fighters
Storage["Storage I/O"] --> Infrastructure
end
ControlPlane -.->|"etcd"| ControlPlane
Wizards -.->|"Fast Storage"| Neverwinter
Wizards -.->|"Backups"| Candlekeep
Rogues -.->|"NFS Mounts"| Candlekeep
Fighters -.->|"NFS Mounts"| Candlekeep
classDef control fill:#2563eb,stroke:#1d4ed8,color:#fff
classDef wizard fill:#7c3aed,stroke:#5b21b6,color:#fff
classDef rogue fill:#059669,stroke:#047857,color:#fff
classDef fighter fill:#dc2626,stroke:#b91c1c,color:#fff
classDef location fill:#d97706,stroke:#b45309,color:#fff
classDef workload fill:#4b5563,stroke:#374151,color:#fff
class Bruenor,Catti,Storm control
class Khelben,Elminster,Drizzt,Danilo,Regis wizard
class Durnan,Elaith,Jarlaxle,Mirt,Volo rogue
class Wulfgar fighter
class Candlekeep,Neverwinter,Waterdeep location
class AI,Edge,Compute,Storage workload

View File

@@ -0,0 +1,63 @@
```plaintext
%% Notification Architecture (ADR-0021)
%% C4 Component diagram showing notification sources and hub
flowchart LR
subgraph sources["📤 Notification Sources"]
direction TB
ci["🔧 Gitea Actions<br/>CI/CD builds"]
alertmanager["🔔 Alertmanager<br/>Prometheus alerts"]
gatus["❤️ Gatus<br/>Health monitoring"]
flux["🔄 Flux<br/>GitOps events"]
end
subgraph hub["📡 Central Hub"]
ntfy["📢 ntfy<br/>Notification Server"]
end
subgraph topics["🏷️ Topics"]
direction TB
t_ci["gitea-ci"]
t_alerts["alertmanager-alerts"]
t_gatus["gatus"]
t_flux["flux"]
t_deploy["deployments"]
end
subgraph consumers["📱 Consumers"]
direction TB
mobile["📱 ntfy App<br/>(iOS/Android)"]
bridge["🌉 ntfy-discord<br/>Bridge"]
discord["💬 Discord<br/>Webhooks"]
end
%% Source to hub
ci -->|"POST"| ntfy
alertmanager -->|"webhook"| ntfy
gatus -->|"webhook"| ntfy
flux -->|"notification-controller"| ntfy
%% Hub to topics
ntfy --> topics
%% Topics to consumers
t_ci --> mobile
t_alerts --> mobile
t_gatus --> mobile
t_flux --> mobile
t_deploy --> mobile
topics --> bridge
bridge --> discord
classDef source fill:#3498db,color:white
classDef hub fill:#e74c3c,color:white
classDef topic fill:#9b59b6,color:white
classDef consumer fill:#27ae60,color:white
class ci,alertmanager,gatus,flux source
class ntfy hub
class t_ci,t_alerts,t_gatus,t_flux,t_deploy topic
class mobile,bridge,discord consumer
```

View File

@@ -0,0 +1,45 @@
```plaintext
%% ntfy-Discord Bridge (ADR-0022)
%% Sequence diagram showing message flow and transformation
sequenceDiagram
autonumber
participant S as Notification Source<br/>(CI/Alertmanager)
participant N as ntfy<br/>Notification Hub
participant B as ntfy-discord<br/>Go Bridge
participant D as Discord<br/>Webhook
Note over S,N: Events published to ntfy topics
S->>N: POST /gitea-ci<br/>{title, message, priority}
Note over N,B: SSE subscription for real-time
N-->>B: SSE JSON stream<br/>{topic, message, priority, tags}
Note over B: Message transformation
rect rgb(240, 240, 240)
B->>B: Map priority to embed color<br/>urgent=red, high=orange<br/>default=blue, low=gray
B->>B: Format as Discord embed<br/>{embeds: [{title, description, color}]}
end
B->>D: POST webhook URL<br/>Discord embed format
Note over B: Hot-reload support
rect rgb(230, 245, 230)
B->>B: fsnotify watches secrets
B->>B: Reload config without restart
end
Note over B,D: Retry with exponential backoff
alt Webhook fails
B-->>B: Retry (2s, 4s, 8s...)
B->>D: Retry POST
end
D-->>D: Display in channel
```

View File

@@ -0,0 +1,72 @@
```plaintext
%% Observability Stack Architecture (ADR-0025)
%% C4 Component diagram showing telemetry flow
flowchart TB
subgraph apps["📦 Applications"]
direction LR
go["Go Apps<br/>(OTEL SDK)"]
python["Python Apps<br/>(OTEL SDK)"]
node["Node.js Apps<br/>(OTEL SDK)"]
java["Java Apps<br/>(OTEL SDK)"]
end
subgraph collection["📡 Telemetry Collection"]
otel["OpenTelemetry<br/>Collector<br/>━━━━━━━━<br/>OTLP gRPC :4317<br/>OTLP HTTP :4318"]
end
subgraph storage["💾 Storage Layer"]
direction LR
subgraph metrics_store["Metrics"]
prometheus["📊 Prometheus<br/>14d retention<br/>50GB"]
end
subgraph logs_traces["Logs & Traces"]
clickstack["📋 ClickStack<br/>(ClickHouse)"]
end
end
subgraph visualization["📈 Visualization"]
grafana["🎨 Grafana<br/>Dashboards<br/>& Exploration"]
end
subgraph alerting["🔔 Alerting Pipeline"]
alertmanager["⚠️ Alertmanager"]
ntfy["📱 ntfy<br/>(Push)"]
discord["💬 Discord"]
end
%% App to collector
go -->|"OTLP"| otel
python -->|"OTLP"| otel
node -->|"OTLP"| otel
java -->|"OTLP"| otel
%% Collector to storage
otel -->|"Metrics"| prometheus
otel -->|"Logs"| clickstack
otel -->|"Traces"| clickstack
%% Storage to visualization
prometheus --> grafana
clickstack --> grafana
%% Alerting flow
prometheus -->|"PrometheusRules"| alertmanager
alertmanager --> ntfy
ntfy --> discord
classDef app fill:#3498db,color:white
classDef otel fill:#e74c3c,color:white
classDef storage fill:#27ae60,color:white
classDef viz fill:#9b59b6,color:white
classDef alert fill:#f39c12,color:black
class go,python,node,java app
class otel otel
class prometheus,clickstack storage
class grafana viz
class alertmanager,ntfy,discord alert
```

View File

@@ -0,0 +1,66 @@
```plaintext
%% Ray Repository Structure (ADR-0024)
%% Flowchart showing build and dynamic loading flow
flowchart TB
subgraph repos["📁 Repositories"]
direction LR
kuberay["kuberay-images<br/>🐳 Docker images<br/>(infrequent updates)"]
rayserve["ray-serve<br/>📦 PyPI package<br/>(frequent updates)"]
end
subgraph ci["🔧 CI/CD Pipelines"]
direction LR
build_images["Build Docker<br/>nvidia, rdna2,<br/>strixhalo, intel"]
build_pypi["Build wheel<br/>uv build"]
end
subgraph registries["📦 Registries"]
direction LR
container_reg["🐳 Container Registry<br/>registry.lab.daviestechlabs.io"]
pypi_reg["📦 PyPI Registry<br/>git.daviestechlabs.io/pypi"]
end
subgraph ray["⚡ Ray Cluster"]
direction TB
head["🧠 Head Node"]
workers["🖥️ Worker Nodes<br/>(GPU-specific)"]
subgraph runtime["🔄 Runtime Loading"]
pull_image["docker pull<br/>ray-worker-*"]
pip_install["pip install ray-serve<br/>runtime_env"]
end
serve_apps["Ray Serve Apps<br/>/llm, /whisper, etc."]
end
subgraph k8s["☸️ Kubernetes"]
manifests["RayService CR<br/>(homelab-k8s2)"]
end
%% Build flows
kuberay --> build_images
rayserve --> build_pypi
build_images --> container_reg
build_pypi --> pypi_reg
%% Deployment flow
manifests --> ray
container_reg --> pull_image
pull_image --> workers
pypi_reg --> pip_install
pip_install --> serve_apps
classDef repo fill:#3498db,color:white
classDef ci fill:#f39c12,color:black
classDef registry fill:#9b59b6,color:white
classDef ray fill:#27ae60,color:white
classDef k8s fill:#e74c3c,color:white
class kuberay,rayserve repo
class build_images,build_pypi ci
class container_reg,pypi_reg registry
class head,workers,pull_image,pip_install,serve_apps ray
class manifests k8s
```

View File

@@ -0,0 +1,86 @@
%% Renovate Dependency Update Workflow
%% Related: ADR-0036
flowchart TB
subgraph Schedule["Schedule"]
Cron["CronJob<br/>Every 8 hours"]
end
subgraph Renovate["Renovate (ci-cd namespace)"]
Job["Renovate Job"]
subgraph Scan["Repository Scan"]
Discover["Autodiscover<br/>Gitea Repos"]
Parse["Parse Dependencies<br/>40+ managers"]
Compare["Compare Versions<br/>Check registries"]
end
end
subgraph Registries["Version Sources"]
DockerHub["Docker Hub"]
GHCR["GHCR"]
PyPI["PyPI"]
GoProxy["Go Proxy"]
Helm["Helm Repos"]
end
subgraph Gitea["Gitea Repositories"]
subgraph Repos["Scanned Repos"]
K8s["homelab-k8s2"]
Handler["chat-handler"]
KubeRay["kuberay-images"]
More["...20+ repos"]
end
subgraph PRs["Generated PRs"]
Grouped["Grouped PR<br/>all-non-major"]
Security["Security PR<br/>CVE fixes"]
Major["Major PR<br/>breaking changes"]
end
Dashboard["Dependency Dashboard<br/>Issue #1"]
end
subgraph Merge["Merge Strategy"]
AutoMerge["Auto-merge<br/>patch + minor"]
Review["Manual Review<br/>major updates"]
end
Cron --> Job
Job --> Discover
Discover --> Parse
Parse --> Compare
Compare --> DockerHub
Compare --> GHCR
Compare --> PyPI
Compare --> GoProxy
Compare --> Helm
Discover --> K8s
Discover --> Handler
Discover --> KubeRay
Discover --> More
Compare --> Grouped
Compare --> Security
Compare --> Major
Job --> Dashboard
Grouped --> AutoMerge
Security --> AutoMerge
Major --> Review
classDef schedule fill:#4a5568,stroke:#718096,color:#fff
classDef renovate fill:#667eea,stroke:#5a67d8,color:#fff
classDef registry fill:#ed8936,stroke:#dd6b20,color:#fff
classDef repo fill:#38a169,stroke:#2f855a,color:#fff
classDef pr fill:#9f7aea,stroke:#805ad5,color:#fff
classDef merge fill:#e53e3e,stroke:#c53030,color:#fff
class Cron schedule
class Job,Discover,Parse,Compare renovate
class DockerHub,GHCR,PyPI,GoProxy,Helm registry
class K8s,Handler,KubeRay,More repo
class Grouped,Security,Major,Dashboard pr
class AutoMerge,Review merge

View File

@@ -0,0 +1,51 @@
```plaintext
%% Secrets Management Strategy (ADR-0017)
%% Flowchart showing dual secret paths: SOPS bootstrap vs Vault runtime
flowchart TB
subgraph bootstrap["🚀 Bootstrap Secrets (Git-encrypted)"]
direction TB
sops_files["*.sops.yaml<br/>📄 Encrypted in Git"]
age_key["🔑 Age Key<br/>(backed up externally)"]
sops_dec["SOPS Decryption"]
flux_dec["Flux Controller"]
bs_secrets["🔐 Bootstrap Secrets<br/>• Talos machine secrets<br/>• GitHub deploy key<br/>• Initial Vault unseal"]
end
subgraph runtime["⚙️ Runtime Secrets (Vault-managed)"]
direction TB
vault["🏦 HashiCorp Vault<br/>HA (3 replicas) + Raft"]
eso["External Secrets<br/>Operator"]
app_secrets["🔑 Application Secrets<br/>• Database credentials<br/>• API keys<br/>• OAuth secrets"]
end
subgraph apps["📦 Applications"]
direction TB
pods["Workload Pods"]
end
%% Bootstrap flow
sops_files -->|"Commit to Git"| flux_dec
age_key -->|"Decrypts"| sops_dec
flux_dec --> sops_dec
sops_dec -->|"Creates"| bs_secrets
%% Runtime flow
vault -->|"ExternalSecret CR"| eso
eso -->|"Syncs to"| app_secrets
%% Consumption
bs_secrets -->|"Mounted"| pods
app_secrets -->|"Mounted"| pods
classDef bootstrap fill:#3498db,color:white
classDef vault fill:#27ae60,color:white
classDef secrets fill:#e74c3c,color:white
classDef app fill:#9b59b6,color:white
class sops_files,age_key,sops_dec,flux_dec bootstrap
class vault,eso vault
class bs_secrets,app_secrets secrets
class pods app
```

View File

@@ -0,0 +1,81 @@
```plaintext
%% Security Policy Enforcement (ADR-0018)
%% Flowchart showing admission control and vulnerability scanning
flowchart TB
subgraph deploy["🚀 Deployment Sources"]
kubectl["kubectl"]
flux["Flux CD"]
end
subgraph admission["🛡️ Admission Control"]
api["Kubernetes<br/>API Server"]
gatekeeper["Gatekeeper (OPA)<br/>⚖️ Policy Validation"]
end
subgraph policies["📋 Policies"]
direction TB
p1["No privileged containers"]
p2["Required labels"]
p3["Resource limits"]
p4["Image registry whitelist"]
end
subgraph enforcement["🎯 Enforcement Modes"]
warn["⚠️ warn<br/>(log only)"]
dryrun["📊 dryrun<br/>(audit)"]
deny["🚫 deny<br/>(block)"]
end
subgraph workloads["☸️ Running Workloads"]
pods["Pods<br/>Deployments<br/>StatefulSets"]
end
subgraph scanning["🔍 Continuous Scanning"]
trivy["Trivy Operator"]
reports["VulnerabilityReports<br/>(CRDs)"]
end
subgraph observability["📈 Observability"]
prometheus["Prometheus<br/>📊 Metrics"]
grafana["Grafana<br/>📉 Dashboards"]
alertmanager["Alertmanager<br/>🔔 Alerts"]
ntfy["ntfy<br/>📱 Notifications"]
end
%% Admission flow
kubectl --> api
flux --> api
api -->|"Intercepts"| gatekeeper
gatekeeper -->|"Evaluates"| policies
policies --> enforcement
warn -->|"Allows"| workloads
dryrun -->|"Allows"| workloads
deny -->|"Blocks"| api
enforcement -->|"Violations"| prometheus
%% Scanning flow
workloads -->|"Scans images"| trivy
trivy -->|"Creates"| reports
reports -->|"Exports"| prometheus
%% Observability flow
prometheus --> grafana
prometheus --> alertmanager
alertmanager --> ntfy
classDef source fill:#f39c12,color:black
classDef admission fill:#3498db,color:white
classDef policy fill:#9b59b6,color:white
classDef workload fill:#27ae60,color:white
classDef scan fill:#e74c3c,color:white
classDef observe fill:#1abc9c,color:white
class kubectl,flux source
class api,gatekeeper admission
class p1,p2,p3,p4,warn,dryrun,deny policy
class pods workload
class trivy,reports scan
class prometheus,grafana,alertmanager,ntfy observe
```

View File

@@ -0,0 +1,67 @@
```plaintext
%% Tiered Storage Strategy (ADR-0026)
%% C4 Component diagram showing Longhorn + NFS dual-tier
flowchart TB
subgraph tier1["🚀 TIER 1: LONGHORN (Fast Distributed Block)"]
direction TB
subgraph nodes["Cluster Nodes"]
direction LR
khelben["🖥️ khelben<br/>/var/mnt/longhorn<br/>NVMe"]
mystra["🖥️ mystra<br/>/var/mnt/longhorn<br/>SSD"]
selune["🖥️ selune<br/>/var/mnt/longhorn<br/>SSD"]
end
longhorn_mgr["⚙️ Longhorn Manager<br/>(Schedules 2-3 replicas)"]
subgraph longhorn_pvcs["Performance Workloads"]
direction LR
pg["🐘 PostgreSQL"]
vault["🔐 Vault"]
prom["📊 Prometheus"]
click["📋 ClickHouse"]
end
end
subgraph tier2["💾 TIER 2: NFS-SLOW (High-Capacity Bulk)"]
direction TB
nas["🗄️ candlekeep.lab.daviestechlabs.io<br/>External NAS<br/>/kubernetes"]
nfs_csi["📂 NFS CSI Driver"]
subgraph nfs_pvcs["Bulk Storage Workloads"]
direction LR
jellyfin["🎬 Jellyfin<br/>(1TB+ media)"]
nextcloud["☁️ Nextcloud"]
immich["📷 Immich"]
kavita["📚 Kavita"]
mlflow["📈 MLflow<br/>Artifacts"]
ray_models["🤖 Ray<br/>Model Weights"]
end
end
%% Tier 1 connections
nodes --> longhorn_mgr
longhorn_mgr --> longhorn_pvcs
%% Tier 2 connections
nas --> nfs_csi
nfs_csi --> nfs_pvcs
classDef tier1_node fill:#3498db,color:white
classDef tier1_mgr fill:#2980b9,color:white
classDef tier1_pvc fill:#1abc9c,color:white
classDef tier2_nas fill:#e74c3c,color:white
classDef tier2_csi fill:#c0392b,color:white
classDef tier2_pvc fill:#f39c12,color:black
class khelben,mystra,selune tier1_node
class longhorn_mgr tier1_mgr
class pg,vault,prom,click tier1_pvc
class nas tier2_nas
class nfs_csi tier2_csi
class jellyfin,nextcloud,immich,kavita,mlflow,ray_models tier2_pvc
```

View File

@@ -0,0 +1,93 @@
```plaintext
%% User Registration and Approval Workflow (ADR-0029)
%% Flowchart showing registration, approval, and access control
flowchart TB
subgraph registration["📝 Registration Flow"]
direction TB
request["👤 User Requests<br/>Account"]
form["📋 Enrollment<br/>Form"]
created["✅ Account<br/>Created"]
pending["⏳ pending-approval<br/>Group"]
end
subgraph approval["✋ Admin Approval"]
direction TB
notify["📧 Admin<br/>Notification"]
review["👁️ Admin<br/>Reviews"]
decision{{"Decision"}}
end
subgraph groups["👥 Group Assignment"]
direction LR
reject["❌ Rejected"]
guests["🎫 homelab-guests<br/>Limited access"]
users["👥 homelab-users<br/>Full access"]
admins["👑 homelab-admins<br/>Admin access"]
end
subgraph access["🔓 Application Access"]
direction TB
subgraph admin_apps["Admin Apps"]
authentik_admin["Authentik Admin"]
gitea["Gitea"]
flux_ui["Flux UI"]
end
subgraph user_apps["User Apps"]
affine["Affine"]
immich["Immich"]
nextcloud["Nextcloud"]
vaultwarden["Vaultwarden"]
end
subgraph guest_apps["Guest Apps"]
kavita["Kavita"]
end
subgraph no_access["No Access"]
profile["Authentik Profile<br/>(only)"]
end
end
%% Registration flow
request --> form
form --> created
created --> pending
pending --> notify
%% Approval flow
notify --> review
review --> decision
decision -->|"Reject"| reject
decision -->|"Basic"| guests
decision -->|"Full"| users
decision -->|"Admin"| admins
%% Access mapping
reject --> profile
guests --> guest_apps
users --> user_apps
users --> guest_apps
admins --> admin_apps
admins --> user_apps
admins --> guest_apps
classDef registration fill:#3498db,color:white
classDef approval fill:#f39c12,color:black
classDef group fill:#9b59b6,color:white
classDef admin fill:#e74c3c,color:white
classDef user fill:#27ae60,color:white
classDef guest fill:#1abc9c,color:white
classDef none fill:#95a5a6,color:white
class request,form,created,pending registration
class notify,review approval
class reject,guests,users,admins group
class authentik_admin,gitea,flux_ui admin
class affine,immich,nextcloud,vaultwarden user
class kavita guest
class profile none
```

View File

@@ -0,0 +1,60 @@
%% Velero Backup Architecture
%% Related: ADR-0032
flowchart TB
subgraph Schedule["Backup Schedule"]
Nightly["Nightly Backup<br/>2:00 AM"]
Hourly["Hourly Snapshots<br/>Critical Namespaces"]
end
subgraph Velero["Velero (velero namespace)"]
Server["Velero Server"]
NodeAgent["Node Agent<br/>(DaemonSet)"]
end
subgraph Sources["Backup Sources"]
PVs["Persistent Volumes<br/>(Longhorn)"]
Resources["Kubernetes Resources<br/>(Secrets, ConfigMaps)"]
DBs["Database Dumps<br/>(Pre-backup hooks)"]
end
subgraph Targets["Backup Destinations"]
subgraph Primary["Primary: S3"]
MinIO["MinIO<br/>On-premises S3"]
end
subgraph Secondary["Secondary: NFS"]
NAS["Synology NAS<br/>Long-term retention"]
end
end
subgraph Restore["Restore Options"]
Full["Full Cluster Restore"]
Namespace["Namespace Restore"]
Selective["Selective Resource Restore"]
end
Nightly --> Server
Hourly --> Server
Server --> NodeAgent
NodeAgent --> PVs
Server --> Resources
Server --> DBs
Server --> MinIO
MinIO -.->|Replicated| NAS
Server --> Full
Server --> Namespace
Server --> Selective
classDef schedule fill:#4a5568,stroke:#718096,color:#fff
classDef velero fill:#667eea,stroke:#5a67d8,color:#fff
classDef source fill:#48bb78,stroke:#38a169,color:#fff
classDef target fill:#ed8936,stroke:#dd6b20,color:#fff
classDef restore fill:#9f7aea,stroke:#805ad5,color:#fff
class Nightly,Hourly schedule
class Server,NodeAgent velero
class PVs,Resources,DBs source
class MinIO,NAS target
class Full,Namespace,Selective restore

View File

@@ -0,0 +1,81 @@
%% Volcano Batch Scheduling Architecture
%% Related: ADR-0034
flowchart TB
subgraph Submissions["Workload Submissions"]
KFP["Kubeflow Pipelines"]
Argo["Argo Workflows"]
Spark["Spark Jobs"]
Ray["Ray Jobs"]
end
subgraph Volcano["Volcano Scheduler"]
Admission["Admission Controller"]
Scheduler["Volcano Scheduler"]
Controller["Job Controller"]
subgraph Plugins["Scheduling Plugins"]
Gang["Gang Scheduling"]
Priority["Priority"]
DRF["Dominant Resource Fairness"]
Binpack["Bin Packing"]
end
end
subgraph Queues["Resource Queues"]
MLQueue["ml-training<br/>weight: 4"]
InferQueue["inference<br/>weight: 3"]
BatchQueue["batch-jobs<br/>weight: 2"]
DefaultQueue["default<br/>weight: 1"]
end
subgraph Resources["Cluster Resources"]
subgraph GPUs["GPU Nodes"]
Khelben["khelben<br/>Strix Halo 64GB"]
Elminster["elminster<br/>RTX 2070"]
Drizzt["drizzt<br/>RDNA2 680M"]
Danilo["danilo<br/>Intel Arc"]
end
subgraph CPU["CPU Nodes"]
Workers["9 x86_64 Workers"]
ARM["5 ARM64 Workers"]
end
end
KFP --> Admission
Argo --> Admission
Spark --> Admission
Ray --> Admission
Admission --> Scheduler
Scheduler --> Controller
Scheduler --> Gang
Scheduler --> Priority
Scheduler --> DRF
Scheduler --> Binpack
Controller --> MLQueue
Controller --> InferQueue
Controller --> BatchQueue
Controller --> DefaultQueue
MLQueue --> GPUs
InferQueue --> GPUs
BatchQueue --> GPUs
BatchQueue --> CPU
DefaultQueue --> CPU
classDef submit fill:#4a5568,stroke:#718096,color:#fff
classDef volcano fill:#667eea,stroke:#5a67d8,color:#fff
classDef plugin fill:#9f7aea,stroke:#805ad5,color:#fff
classDef queue fill:#ed8936,stroke:#dd6b20,color:#fff
classDef gpu fill:#e53e3e,stroke:#c53030,color:#fff
classDef cpu fill:#38a169,stroke:#2f855a,color:#fff
class KFP,Argo,Spark,Ray submit
class Admission,Scheduler,Controller volcano
class Gang,Priority,DRF,Binpack plugin
class MLQueue,InferQueue,BatchQueue,DefaultQueue queue
class Khelben,Elminster,Drizzt,Danilo gpu
class Workers,ARM cpu