Files
homelab-design/TECH-STACK.md

283 lines
8.6 KiB
Markdown

# 🛠️ Technology Stack
> **Complete inventory of technologies used in the DaviesTechLabs homelab**
## Platform Layer
### Operating System
| Component | Version | Purpose |
|-----------|---------|---------|
| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
| Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
### Container Orchestration
| Component | Version | Purpose |
|-----------|---------|---------|
| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
| [containerd](https://containerd.io) | 2.1.6 | Container runtime |
| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
### GitOps
| Component | Version | Purpose |
|-----------|---------|---------|
| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
---
## AI/ML Layer
### GPU Inference (KubeRay RayService)
All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:
| Service | Model | GPU Node | GPU Type | Allocation |
|---------|-------|----------|----------|------------|
| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
### ML Serving Stack
| Component | Version | Purpose |
|-----------|---------|---------|
| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |
### ML Workflows
| Component | Version | Purpose |
|-----------|---------|---------|
| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
### GPU Scheduling
| Component | Version | Purpose |
|-----------|---------|---------|
| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
| NVIDIA Device Plugin | Latest | CUDA GPU allocation |
| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
---
## Data Layer
### Databases
| Component | Version | Purpose |
|-----------|---------|---------|
| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
| [Milvus](https://milvus.io) | Latest | Vector database for RAG |
| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
| [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
### Object Storage
| Component | Version | Purpose |
|-----------|---------|---------|
| [MinIO](https://min.io) | Latest | S3-compatible storage |
| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
| NFS CSI Driver | Latest | Shared filesystem |
### Messaging
| Component | Version | Purpose |
|-----------|---------|---------|
| [NATS](https://nats.io) | Latest | Message bus |
| NATS JetStream | Built-in | Persistent streaming |
### Data Processing
| Component | Version | Purpose |
|-----------|---------|---------|
| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
| [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
| [Nessie](https://projectnessie.org) | Latest | Data catalog |
| [Trino](https://trino.io) | 479 | SQL query engine |
---
## Application Layer
### Web Frameworks
| Application | Language | Framework | Purpose |
|-------------|----------|-----------|---------|
| Companions | Go | net/http + HTMX | AI chat interface |
| Voice WebApp | Python | Gradio | Voice assistant UI |
| Various handlers | Python | asyncio + nats.py | NATS event handlers |
### Frontend
| Technology | Purpose |
|------------|---------|
| [HTMX](https://htmx.org) | Dynamic HTML updates |
| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
| [VRM](https://vrm.dev) | 3D avatar rendering |
---
## Networking Layer
### Ingress
| Component | Version | Purpose |
|-----------|---------|---------|
| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
### DNS & Certificates
| Component | Version | Purpose |
|-----------|---------|---------|
| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
### Service Mesh
| Component | Purpose |
|-----------|---------|
| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
---
## Security Layer
### Identity & Access
| Component | Version | Purpose |
|-----------|---------|---------|
| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
| [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
### Runtime Security
| Component | Version | Purpose |
|-----------|---------|---------|
| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
| Cilium Network Policies | Built-in | Network segmentation |
### Backup
| Component | Version | Purpose |
|-----------|---------|---------|
| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
---
## Observability Layer
### Metrics
| Component | Purpose |
|-----------|---------|
| [Prometheus](https://prometheus.io) | Metrics collection |
| [Grafana](https://grafana.com) | Dashboards & visualization |
### Logging
| Component | Version | Purpose |
|-----------|---------|---------|
| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
### Tracing
| Component | Purpose |
|-----------|---------|
| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
| Tempo/Jaeger | Trace storage & query |
---
## Development Tools
### Local Development
| Tool | Purpose |
|------|---------|
| [mise](https://mise.jdx.dev) | Tool version management |
| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
### CI/CD
| Tool | Purpose |
|------|---------|
| GitHub Actions | CI/CD pipelines |
| [Renovate](https://renovatebot.com) | Dependency updates |
### Image Building
| Tool | Purpose |
|------|---------|
| Docker | Container builds |
| GHCR | Container registry |
---
## Media & Entertainment
| Component | Version | Purpose |
|-----------|---------|---------|
| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
| Prowlarr, Bazarr, etc. | Various | *arr stack |
| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
---
## Python Dependencies (handler-base)
Core library for all NATS handlers: [handler-base](https://git.daviestechlabs.io/daviestechlabs/handler-base)
```toml
# Core
nats-py>=2.7.0 # NATS client
msgpack>=1.0.0 # Binary serialization
httpx>=0.27.0 # HTTP client
# ML/AI
pymilvus>=2.4.0 # Milvus client
openai>=1.0.0 # vLLM OpenAI API
# Observability
opentelemetry-api>=1.20.0
opentelemetry-sdk>=1.20.0
mlflow>=2.10.0 # Experiment tracking
# Kubeflow (kubeflow repo)
kfp>=2.12.1 # Pipeline SDK
```
---
## Version Pinning Strategy
| Component Type | Strategy |
|----------------|----------|
| Base images | Pin major.minor |
| Helm charts | Pin exact version |
| Python packages | Pin minimum version |
| System extensions | Pin via Talos schematic |
## Related Documents
- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
- [decisions/](decisions/) - Why we chose specific technologies