feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
271
TECH-STACK.md
Normal file
271
TECH-STACK.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# 🛠️ Technology Stack
|
||||
|
||||
> **Complete inventory of technologies used in the DaviesTechLabs homelab**
|
||||
|
||||
## Platform Layer
|
||||
|
||||
### Operating System
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
|
||||
| Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
|
||||
|
||||
### Container Orchestration
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
|
||||
| [containerd](https://containerd.io) | 2.1.6 | Container runtime |
|
||||
| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
|
||||
|
||||
### GitOps
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
|
||||
| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
|
||||
| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
|
||||
|
||||
---
|
||||
|
||||
## AI/ML Layer
|
||||
|
||||
### Inference Engines
|
||||
|
||||
| Service | Framework | GPU | Model Type |
|
||||
|---------|-----------|-----|------------|
|
||||
| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
|
||||
| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
|
||||
| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
|
||||
| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
|
||||
| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
|
||||
|
||||
### ML Serving
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
|
||||
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
|
||||
|
||||
### ML Workflows
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
|
||||
| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
|
||||
| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
|
||||
| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
|
||||
|
||||
### GPU Scheduling
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
|
||||
| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
|
||||
| NVIDIA Device Plugin | Latest | CUDA GPU allocation |
|
||||
| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
|
||||
|
||||
---
|
||||
|
||||
## Data Layer
|
||||
|
||||
### Databases
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
|
||||
| [Milvus](https://milvus.io) | Latest | Vector database for RAG |
|
||||
| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
|
||||
| [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
|
||||
|
||||
### Object Storage
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [MinIO](https://min.io) | Latest | S3-compatible storage |
|
||||
| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
|
||||
| NFS CSI Driver | Latest | Shared filesystem |
|
||||
|
||||
### Messaging
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [NATS](https://nats.io) | Latest | Message bus |
|
||||
| NATS JetStream | Built-in | Persistent streaming |
|
||||
|
||||
### Data Processing
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
|
||||
| [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
|
||||
| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
|
||||
| [Nessie](https://projectnessie.org) | Latest | Data catalog |
|
||||
| [Trino](https://trino.io) | 479 | SQL query engine |
|
||||
|
||||
---
|
||||
|
||||
## Application Layer
|
||||
|
||||
### Web Frameworks
|
||||
|
||||
| Application | Language | Framework | Purpose |
|
||||
|-------------|----------|-----------|---------|
|
||||
| Companions | Go | net/http + HTMX | AI chat interface |
|
||||
| Voice WebApp | Python | Gradio | Voice assistant UI |
|
||||
| Various handlers | Python | asyncio + nats.py | NATS event handlers |
|
||||
|
||||
### Frontend
|
||||
|
||||
| Technology | Purpose |
|
||||
|------------|---------|
|
||||
| [HTMX](https://htmx.org) | Dynamic HTML updates |
|
||||
| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
|
||||
| [VRM](https://vrm.dev) | 3D avatar rendering |
|
||||
|
||||
---
|
||||
|
||||
## Networking Layer
|
||||
|
||||
### Ingress
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
|
||||
| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
|
||||
|
||||
### DNS & Certificates
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
|
||||
| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
|
||||
|
||||
### Service Mesh
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
|
||||
|
||||
---
|
||||
|
||||
## Security Layer
|
||||
|
||||
### Identity & Access
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
|
||||
| [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
|
||||
| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
|
||||
|
||||
### Runtime Security
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
|
||||
| Cilium Network Policies | Built-in | Network segmentation |
|
||||
|
||||
### Backup
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
|
||||
|
||||
---
|
||||
|
||||
## Observability Layer
|
||||
|
||||
### Metrics
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| [Prometheus](https://prometheus.io) | Metrics collection |
|
||||
| [Grafana](https://grafana.com) | Dashboards & visualization |
|
||||
|
||||
### Logging
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
|
||||
| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
|
||||
|
||||
### Tracing
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
|
||||
| Tempo/Jaeger | Trace storage & query |
|
||||
|
||||
---
|
||||
|
||||
## Development Tools
|
||||
|
||||
### Local Development
|
||||
|
||||
| Tool | Purpose |
|
||||
|------|---------|
|
||||
| [mise](https://mise.jdx.dev) | Tool version management |
|
||||
| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
|
||||
| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
|
||||
|
||||
### CI/CD
|
||||
|
||||
| Tool | Purpose |
|
||||
|------|---------|
|
||||
| GitHub Actions | CI/CD pipelines |
|
||||
| [Renovate](https://renovatebot.com) | Dependency updates |
|
||||
|
||||
### Image Building
|
||||
|
||||
| Tool | Purpose |
|
||||
|------|---------|
|
||||
| Docker | Container builds |
|
||||
| GHCR | Container registry |
|
||||
|
||||
---
|
||||
|
||||
## Media & Entertainment
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
|
||||
| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
|
||||
| Prowlarr, Bazarr, etc. | Various | *arr stack |
|
||||
| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
|
||||
|
||||
---
|
||||
|
||||
## Python Dependencies (llm-workflows)
|
||||
|
||||
```toml
|
||||
# Core
|
||||
nats-py>=2.7.0 # NATS client
|
||||
msgpack>=1.0.0 # Binary serialization
|
||||
aiohttp>=3.9.0 # HTTP client
|
||||
|
||||
# ML/AI
|
||||
pymilvus>=2.4.0 # Milvus client
|
||||
sentence-transformers # Embeddings
|
||||
openai>=1.0.0 # vLLM OpenAI API
|
||||
|
||||
# Kubeflow
|
||||
kfp>=2.12.1 # Pipeline SDK
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version Pinning Strategy
|
||||
|
||||
| Component Type | Strategy |
|
||||
|----------------|----------|
|
||||
| Base images | Pin major.minor |
|
||||
| Helm charts | Pin exact version |
|
||||
| Python packages | Pin minimum version |
|
||||
| System extensions | Pin via Talos schematic |
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
|
||||
- [decisions/](decisions/) - Why we chose specific technologies
|
||||
Reference in New Issue
Block a user