feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
71
decisions/0000-template.md
Normal file
71
decisions/0000-template.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# [short title of solved problem and solution]
|
||||
|
||||
* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
|
||||
* Date: YYYY-MM-DD
|
||||
* Deciders: [list of people involved in decision]
|
||||
* Technical Story: [description | ticket/issue URL]
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* [driver 1, e.g., a force, facing concern, …]
|
||||
* [driver 2, e.g., a force, facing concern, …]
|
||||
* … <!-- numbers of drivers can vary -->
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [option 1]
|
||||
* [option 2]
|
||||
* [option 3]
|
||||
* … <!-- numbers of options can vary -->
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* [e.g., compromising quality attribute, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### [option 1]
|
||||
|
||||
[example | description | pointer to more information | …]
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 2]
|
||||
|
||||
[example | description | pointer to more information | …]
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 3]
|
||||
|
||||
[example | description | pointer to more information | …]
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
## Links
|
||||
|
||||
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
|
||||
* … <!-- numbers of links can vary -->
|
||||
79
decisions/0001-record-architecture-decisions.md
Normal file
79
decisions/0001-record-architecture-decisions.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Record Architecture Decisions
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-11-30
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Initial setup of homelab documentation
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Need to preserve context for why decisions were made
|
||||
* Enable future maintainers (including AI agents) to understand the system
|
||||
* Provide a structured way to evaluate alternatives
|
||||
* Support the wiki/design process for iterative improvements
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Informal documentation in README files
|
||||
* Wiki pages without structure
|
||||
* Architecture Decision Records (ADRs)
|
||||
* No documentation (rely on code)
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Clear historical record of decisions
|
||||
* Structured format makes decisions searchable
|
||||
* Forces consideration of alternatives
|
||||
* Git-versioned alongside code
|
||||
* AI agents can parse and understand decisions
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Requires discipline to create ADRs
|
||||
* May accumulate outdated decisions over time
|
||||
* Additional overhead for simple decisions
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Informal README documentation
|
||||
|
||||
* Good, because low friction
|
||||
* Good, because close to code
|
||||
* Bad, because no structure for alternatives
|
||||
* Bad, because decisions get buried in prose
|
||||
|
||||
### Wiki pages
|
||||
|
||||
* Good, because easy to edit
|
||||
* Good, because supports rich formatting
|
||||
* Bad, because separate from code repository
|
||||
* Bad, because no enforced structure
|
||||
|
||||
### ADRs
|
||||
|
||||
* Good, because structured format
|
||||
* Good, because version controlled
|
||||
* Good, because captures alternatives considered
|
||||
* Good, because industry-standard practice
|
||||
* Bad, because requires creating new files
|
||||
* Bad, because may seem bureaucratic for small decisions
|
||||
|
||||
### No documentation
|
||||
|
||||
* Good, because no overhead
|
||||
* Bad, because context is lost
|
||||
* Bad, because makes onboarding difficult
|
||||
* Bad, because risky for future changes
|
||||
|
||||
## Links
|
||||
|
||||
* Based on [MADR template](https://adr.github.io/madr/)
|
||||
* [ADR GitHub organization](https://adr.github.io/)
|
||||
97
decisions/0002-use-talos-linux.md
Normal file
97
decisions/0002-use-talos-linux.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Use Talos Linux for Kubernetes Nodes
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-11-30
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting OS for bare-metal Kubernetes cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Security-first design (immutable, minimal)
|
||||
* API-driven management (no SSH)
|
||||
* Support for various GPU drivers
|
||||
* Kubernetes-native focus
|
||||
* Community support and updates
|
||||
* Ease of upgrades
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Ubuntu Server with kubeadm
|
||||
* Flatcar Container Linux
|
||||
* Talos Linux
|
||||
* k3OS (discontinued)
|
||||
* Rocky Linux with RKE2
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Immutable root filesystem prevents drift
|
||||
* No SSH reduces attack vectors
|
||||
* API-driven management integrates well with GitOps
|
||||
* Schematic system allows custom kernel modules (GPU drivers)
|
||||
* Consistent configuration across all nodes
|
||||
* Automatic updates with minimal disruption
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Learning curve for API-driven management
|
||||
* Debugging requires different approaches (no SSH)
|
||||
* Custom extensions require schematic IDs
|
||||
* Less flexibility for non-Kubernetes workloads
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Ubuntu Server with kubeadm
|
||||
|
||||
* Good, because familiar
|
||||
* Good, because extensive package availability
|
||||
* Good, because easy debugging via SSH
|
||||
* Bad, because mutable system leads to drift
|
||||
* Bad, because large attack surface
|
||||
* Bad, because manual package management
|
||||
|
||||
### Flatcar Container Linux
|
||||
|
||||
* Good, because immutable
|
||||
* Good, because auto-updates
|
||||
* Good, because container-focused
|
||||
* Bad, because less Kubernetes-specific
|
||||
* Bad, because smaller community than Talos
|
||||
* Bad, because GPU driver setup more complex
|
||||
|
||||
### Talos Linux
|
||||
|
||||
* Good, because purpose-built for Kubernetes
|
||||
* Good, because immutable and minimal
|
||||
* Good, because API-driven (no SSH)
|
||||
* Good, because excellent Kubernetes integration
|
||||
* Good, because active development and community
|
||||
* Good, because schematic system for GPU drivers
|
||||
* Bad, because learning curve
|
||||
* Bad, because no traditional debugging
|
||||
|
||||
### k3OS
|
||||
|
||||
* Good, because simple
|
||||
* Bad, because discontinued
|
||||
|
||||
### Rocky Linux with RKE2
|
||||
|
||||
* Good, because enterprise-like
|
||||
* Good, because familiar Linux experience
|
||||
* Bad, because mutable system
|
||||
* Bad, because more operational overhead
|
||||
* Bad, because larger attack surface
|
||||
|
||||
## Links
|
||||
|
||||
* [Talos Linux](https://talos.dev)
|
||||
* [Talos Image Factory](https://factory.talos.dev)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
|
||||
112
decisions/0003-use-nats-for-messaging.md
Normal file
112
decisions/0003-use-nats-for-messaging.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Use NATS for AI/ML Messaging
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting message bus for AI service orchestration
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The AI/ML platform requires a messaging system for:
|
||||
- Real-time chat message routing
|
||||
- Voice request/response streaming
|
||||
- Pipeline triggers and status updates
|
||||
- Event-driven workflow orchestration
|
||||
|
||||
We need a messaging system that handles both ephemeral real-time messages and persistent streams.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Low latency for real-time chat/voice
|
||||
* Persistence for audit and replay
|
||||
* Simple operations for homelab
|
||||
* Support for request-reply pattern
|
||||
* Wildcard subscriptions for routing
|
||||
* Binary message support (audio data)
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Apache Kafka
|
||||
* RabbitMQ
|
||||
* Redis Pub/Sub + Streams
|
||||
* NATS with JetStream
|
||||
* Apache Pulsar
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Sub-millisecond latency for real-time messages
|
||||
* JetStream provides persistence when needed
|
||||
* Simple deployment (single binary)
|
||||
* Excellent Kubernetes integration
|
||||
* Request-reply pattern built-in
|
||||
* Wildcard subscriptions for flexible routing
|
||||
* Low resource footprint
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Less ecosystem than Kafka
|
||||
* JetStream less mature than Kafka Streams
|
||||
* No built-in schema registry
|
||||
* Smaller community than RabbitMQ
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Apache Kafka
|
||||
|
||||
* Good, because industry standard for streaming
|
||||
* Good, because rich ecosystem (Kafka Streams, Connect)
|
||||
* Good, because schema registry
|
||||
* Good, because excellent for high throughput
|
||||
* Bad, because operationally complex (ZooKeeper/KRaft)
|
||||
* Bad, because high resource requirements
|
||||
* Bad, because overkill for homelab scale
|
||||
* Bad, because higher latency for real-time messages
|
||||
|
||||
### RabbitMQ
|
||||
|
||||
* Good, because mature and stable
|
||||
* Good, because flexible routing
|
||||
* Good, because good management UI
|
||||
* Bad, because AMQP protocol overhead
|
||||
* Bad, because not designed for streaming
|
||||
* Bad, because more complex clustering
|
||||
|
||||
### Redis Pub/Sub + Streams
|
||||
|
||||
* Good, because simple
|
||||
* Good, because already might use Redis
|
||||
* Good, because low latency
|
||||
* Bad, because pub/sub not persistent
|
||||
* Bad, because streams API less intuitive
|
||||
* Bad, because not primary purpose of Redis
|
||||
|
||||
### NATS with JetStream
|
||||
|
||||
* Good, because extremely low latency
|
||||
* Good, because simple operations
|
||||
* Good, because both pub/sub and persistence
|
||||
* Good, because request-reply built-in
|
||||
* Good, because wildcard subscriptions
|
||||
* Good, because low resource usage
|
||||
* Good, because excellent Go/Python clients
|
||||
* Bad, because smaller ecosystem
|
||||
* Bad, because JetStream newer than Kafka
|
||||
|
||||
### Apache Pulsar
|
||||
|
||||
* Good, because unified messaging + streaming
|
||||
* Good, because multi-tenancy
|
||||
* Good, because geo-replication
|
||||
* Bad, because complex architecture
|
||||
* Bad, because high resource requirements
|
||||
* Bad, because smaller community
|
||||
|
||||
## Links
|
||||
|
||||
* [NATS.io](https://nats.io)
|
||||
* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
|
||||
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
|
||||
137
decisions/0004-use-messagepack-for-nats.md
Normal file
137
decisions/0004-use-messagepack-for-nats.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Use MessagePack for NATS Messages
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting serialization format for NATS messages
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
NATS messages in the AI platform carry various payloads:
|
||||
- Text chat messages (small)
|
||||
- Voice audio data (potentially large, base64 or binary)
|
||||
- Streaming response chunks
|
||||
- Pipeline parameters
|
||||
|
||||
We need a serialization format that handles both text and binary efficiently.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Efficient binary data handling (audio)
|
||||
* Compact message size
|
||||
* Fast serialization/deserialization
|
||||
* Cross-language support (Python, Go)
|
||||
* Debugging ability
|
||||
* Schema flexibility
|
||||
|
||||
## Considered Options
|
||||
|
||||
* JSON
|
||||
* Protocol Buffers (protobuf)
|
||||
* MessagePack (msgpack)
|
||||
* CBOR
|
||||
* Avro
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Native binary support (no base64 overhead for audio)
|
||||
* 20-50% smaller than JSON for typical messages
|
||||
* Faster serialization than JSON
|
||||
* No schema compilation step
|
||||
* Easy debugging (can pretty-print like JSON)
|
||||
* Excellent Python and Go libraries
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Less human-readable than JSON when raw
|
||||
* No built-in schema validation
|
||||
* Slightly less common than JSON
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### JSON
|
||||
|
||||
* Good, because human-readable
|
||||
* Good, because universal support
|
||||
* Good, because no setup required
|
||||
* Bad, because binary data requires base64 (33% overhead)
|
||||
* Bad, because larger message sizes
|
||||
* Bad, because slower parsing
|
||||
|
||||
### Protocol Buffers
|
||||
|
||||
* Good, because very compact
|
||||
* Good, because fast
|
||||
* Good, because schema validation
|
||||
* Good, because cross-language
|
||||
* Bad, because requires schema definition
|
||||
* Bad, because compilation step
|
||||
* Bad, because less flexible for evolving schemas
|
||||
* Bad, because overkill for simple messages
|
||||
|
||||
### MessagePack
|
||||
|
||||
* Good, because binary-efficient
|
||||
* Good, because JSON-like simplicity
|
||||
* Good, because no schema required
|
||||
* Good, because excellent library support
|
||||
* Good, because can include raw bytes
|
||||
* Bad, because not human-readable raw
|
||||
* Bad, because no schema validation
|
||||
|
||||
### CBOR
|
||||
|
||||
* Good, because binary-efficient
|
||||
* Good, because IETF standard
|
||||
* Good, because schema-less
|
||||
* Bad, because less common libraries
|
||||
* Bad, because smaller community
|
||||
* Bad, because similar to msgpack with less adoption
|
||||
|
||||
### Avro
|
||||
|
||||
* Good, because schema evolution
|
||||
* Good, because compact
|
||||
* Good, because schema registry integration
|
||||
* Bad, because requires schema
|
||||
* Bad, because more complex setup
|
||||
* Bad, because Java-centric ecosystem
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
```python
|
||||
# Python usage
|
||||
import msgpack
|
||||
|
||||
# Serialize
|
||||
data = {
|
||||
"user_id": "user-123",
|
||||
"audio": audio_bytes, # Raw bytes, no base64
|
||||
"premium": True
|
||||
}
|
||||
payload = msgpack.packb(data)
|
||||
|
||||
# Deserialize
|
||||
data = msgpack.unpackb(payload, raw=False)
|
||||
```
|
||||
|
||||
```go
|
||||
// Go usage
|
||||
import "github.com/vmihailenco/msgpack/v5"
|
||||
|
||||
type Message struct {
|
||||
UserID string `msgpack:"user_id"`
|
||||
Audio []byte `msgpack:"audio"`
|
||||
}
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* [MessagePack Specification](https://msgpack.org)
|
||||
* [msgpack-python](https://github.com/msgpack/msgpack-python)
|
||||
* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
|
||||
* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)
|
||||
145
decisions/0005-multi-gpu-strategy.md
Normal file
145
decisions/0005-multi-gpu-strategy.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Multi-GPU Heterogeneous Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: GPU allocation strategy for AI workloads
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab has diverse GPU hardware:
|
||||
- AMD Strix Halo (64GB unified memory) - khelben
|
||||
- NVIDIA RTX 2070 (8GB VRAM) - elminster
|
||||
- AMD Radeon 680M (12GB VRAM) - drizzt
|
||||
- Intel Arc (integrated) - danilo
|
||||
|
||||
Different AI workloads have different requirements. How do we allocate GPUs effectively?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Maximize utilization of all GPUs
|
||||
* Match workloads to appropriate hardware
|
||||
* Support concurrent inference services
|
||||
* Enable fractional GPU sharing where appropriate
|
||||
* Minimize cross-vendor complexity
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Single GPU vendor only
|
||||
* All workloads on largest GPU
|
||||
* Workload-specific GPU allocation
|
||||
* Dynamic GPU scheduling (MIG/fractional)
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
|
||||
|
||||
### Allocation Strategy
|
||||
|
||||
| Workload | GPU | Node | Rationale |
|
||||
|----------|-----|------|-----------|
|
||||
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
|
||||
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
|
||||
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
|
||||
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
|
||||
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Each workload gets optimal hardware
|
||||
* No GPU memory contention for LLM
|
||||
* NVIDIA services can share via time-slicing
|
||||
* Cost-effective use of varied hardware
|
||||
* Clear ownership and debugging
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* More complex scheduling (node taints/tolerations)
|
||||
* Less flexibility for workload migration
|
||||
* Must maintain multiple GPU driver stacks
|
||||
* Some GPUs underutilized at times
|
||||
|
||||
## Implementation
|
||||
|
||||
### Node Taints
|
||||
|
||||
```yaml
|
||||
# khelben - dedicated vLLM node
|
||||
nodeTaints:
|
||||
dedicated: "vllm:NoSchedule"
|
||||
```
|
||||
|
||||
### Pod Tolerations and Node Affinity
|
||||
|
||||
```yaml
|
||||
# vLLM deployment
|
||||
spec:
|
||||
tolerations:
|
||||
- key: "dedicated"
|
||||
operator: "Equal"
|
||||
value: "vllm"
|
||||
effect: "NoSchedule"
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values: ["khelben"]
|
||||
```
|
||||
|
||||
### Resource Limits
|
||||
|
||||
```yaml
|
||||
# NVIDIA GPU (elminster)
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
|
||||
# AMD GPU (drizzt, khelben)
|
||||
resources:
|
||||
limits:
|
||||
amd.com/gpu: 1
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Single GPU vendor only
|
||||
|
||||
* Good, because simpler driver management
|
||||
* Good, because consistent tooling
|
||||
* Bad, because wastes existing hardware
|
||||
* Bad, because higher cost for new hardware
|
||||
|
||||
### All workloads on largest GPU
|
||||
|
||||
* Good, because simple scheduling
|
||||
* Good, because unified memory benefits
|
||||
* Bad, because memory contention
|
||||
* Bad, because single point of failure
|
||||
* Bad, because wastes other GPUs
|
||||
|
||||
### Workload-specific allocation (chosen)
|
||||
|
||||
* Good, because optimal hardware matching
|
||||
* Good, because uses all available GPUs
|
||||
* Good, because clear resource boundaries
|
||||
* Good, because parallel inference
|
||||
* Bad, because more complex configuration
|
||||
* Bad, because multiple driver stacks
|
||||
|
||||
### Dynamic GPU scheduling
|
||||
|
||||
* Good, because flexible
|
||||
* Good, because maximizes utilization
|
||||
* Bad, because complex to implement
|
||||
* Bad, because MIG not available on consumer GPUs
|
||||
* Bad, because cross-vendor scheduling immature
|
||||
|
||||
## Links
|
||||
|
||||
* [Volcano Scheduler](https://volcano.sh)
|
||||
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
|
||||
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
|
||||
140
decisions/0006-gitops-with-flux.md
Normal file
140
decisions/0006-gitops-with-flux.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# GitOps with Flux CD
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-11-30
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Implementing GitOps for cluster management
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Infrastructure as Code (IaC) principles
|
||||
* Audit trail for all changes
|
||||
* Self-healing cluster state
|
||||
* Multi-repository support
|
||||
* Secret encryption integration
|
||||
* Active community and maintenance
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Manual kubectl apply
|
||||
* ArgoCD
|
||||
* Flux CD
|
||||
* Rancher Fleet
|
||||
* Pulumi/Terraform for Kubernetes
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Git is single source of truth
|
||||
* Automatic drift detection and correction
|
||||
* Native SOPS/Age secret encryption
|
||||
* Multi-repository support (homelab-k8s2 + llm-workflows)
|
||||
* Helm and Kustomize native support
|
||||
* Webhook-free sync (pull-based)
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* No built-in UI (use CLI or third-party)
|
||||
* Learning curve for CRD-based configuration
|
||||
* Debugging requires understanding Flux controllers
|
||||
|
||||
## Configuration
|
||||
|
||||
### Repository Structure
|
||||
|
||||
```
|
||||
homelab-k8s2/
|
||||
├── kubernetes/
|
||||
│ ├── flux/ # Flux system config
|
||||
│ │ ├── config/
|
||||
│ │ │ ├── cluster.yaml
|
||||
│ │ │ └── secrets.yaml # SOPS encrypted
|
||||
│ │ └── repositories/
|
||||
│ │ ├── helm/ # HelmRepositories
|
||||
│ │ └── git/ # GitRepositories
|
||||
│ └── apps/ # Application Kustomizations
|
||||
```
|
||||
|
||||
### Multi-Repository Sync
|
||||
|
||||
```yaml
|
||||
# GitRepository for llm-workflows
|
||||
apiVersion: source.toolkit.fluxcd.io/v1
|
||||
kind: GitRepository
|
||||
metadata:
|
||||
name: llm-workflows
|
||||
namespace: flux-system
|
||||
spec:
|
||||
url: ssh://git@github.com/Billy-Davies-2/llm-workflows
|
||||
ref:
|
||||
branch: main
|
||||
secretRef:
|
||||
name: github-deploy-key
|
||||
```
|
||||
|
||||
### SOPS Integration
|
||||
|
||||
```yaml
|
||||
# .sops.yaml
|
||||
creation_rules:
|
||||
- path_regex: .*\.sops\.yaml$
|
||||
age: >-
|
||||
age1... # Public key
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Manual kubectl apply
|
||||
|
||||
* Good, because simple
|
||||
* Good, because no setup
|
||||
* Bad, because no audit trail
|
||||
* Bad, because no drift detection
|
||||
* Bad, because not reproducible
|
||||
|
||||
### ArgoCD
|
||||
|
||||
* Good, because great UI
|
||||
* Good, because app-of-apps pattern
|
||||
* Good, because large community
|
||||
* Bad, because heavier resource usage
|
||||
* Bad, because webhook-dependent sync
|
||||
* Bad, because SOPS requires plugins
|
||||
|
||||
### Flux CD
|
||||
|
||||
* Good, because lightweight
|
||||
* Good, because pull-based (no webhooks)
|
||||
* Good, because native SOPS support
|
||||
* Good, because multi-source/multi-tenant
|
||||
* Good, because Kubernetes-native CRDs
|
||||
* Bad, because no built-in UI
|
||||
* Bad, because CRD learning curve
|
||||
|
||||
### Rancher Fleet
|
||||
|
||||
* Good, because integrated with Rancher
|
||||
* Good, because multi-cluster
|
||||
* Bad, because Rancher ecosystem lock-in
|
||||
* Bad, because smaller community
|
||||
|
||||
### Pulumi/Terraform
|
||||
|
||||
* Good, because familiar IaC tools
|
||||
* Good, because drift detection
|
||||
* Bad, because not Kubernetes-native
|
||||
* Bad, because requires state management
|
||||
* Bad, because not continuous reconciliation
|
||||
|
||||
## Links
|
||||
|
||||
* [Flux CD](https://fluxcd.io)
|
||||
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
|
||||
* [flux-local](https://github.com/allenporter/flux-local) - Local testing
|
||||
115
decisions/0007-use-kserve-for-inference.md
Normal file
115
decisions/0007-use-kserve-for-inference.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Use KServe for ML Model Serving
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-15
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting model serving platform for inference services
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Standardized inference protocol (V2)
|
||||
* Autoscaling based on load
|
||||
* Traffic splitting for canary deployments
|
||||
* Integration with Kubeflow ecosystem
|
||||
* GPU resource management
|
||||
* Health checks and readiness
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Raw Kubernetes Deployments + Services
|
||||
* KServe InferenceService
|
||||
* Seldon Core
|
||||
* BentoML
|
||||
* Ray Serve only
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Standardized V2 inference protocol
|
||||
* Automatic scale-to-zero capability
|
||||
* Canary/blue-green deployments
|
||||
* Integration with Kubeflow UI
|
||||
* Transformer/Explainer components
|
||||
* GPU resource abstraction
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Additional CRDs and operators
|
||||
* Learning curve for InferenceService spec
|
||||
* Some overhead for simple deployments
|
||||
* Knative Serving dependency (optional)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Raw Kubernetes Deployments
|
||||
|
||||
* Good, because simple
|
||||
* Good, because full control
|
||||
* Bad, because no autoscaling logic
|
||||
* Bad, because manual service mesh
|
||||
* Bad, because repetitive configuration
|
||||
|
||||
### KServe InferenceService
|
||||
|
||||
* Good, because standardized API
|
||||
* Good, because autoscaling
|
||||
* Good, because traffic management
|
||||
* Good, because Kubeflow integration
|
||||
* Bad, because operator complexity
|
||||
* Bad, because Knative optional dependency
|
||||
|
||||
### Seldon Core
|
||||
|
||||
* Good, because mature
|
||||
* Good, because A/B testing
|
||||
* Good, because explainability
|
||||
* Bad, because more complex than KServe
|
||||
* Bad, because heavier resource usage
|
||||
|
||||
### BentoML
|
||||
|
||||
* Good, because developer-friendly
|
||||
* Good, because packaging focused
|
||||
* Bad, because less Kubernetes-native
|
||||
* Bad, because smaller community
|
||||
|
||||
### Ray Serve
|
||||
|
||||
* Good, because unified compute
|
||||
* Good, because Python-native
|
||||
* Good, because fractional GPU
|
||||
* Bad, because less standardized API
|
||||
* Bad, because Ray cluster overhead
|
||||
|
||||
## Current Configuration
|
||||
|
||||
```yaml
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
metadata:
|
||||
name: whisper
|
||||
namespace: ai-ml
|
||||
spec:
|
||||
predictor:
|
||||
minReplicas: 1
|
||||
maxReplicas: 3
|
||||
containers:
|
||||
- name: whisper
|
||||
image: ghcr.io/org/whisper:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* [KServe](https://kserve.github.io)
|
||||
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
||||
107
decisions/0008-use-milvus-for-vectors.md
Normal file
107
decisions/0008-use-milvus-for-vectors.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Use Milvus for Vector Storage
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-15
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting vector database for RAG system
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Query performance (< 100ms for top-k search)
|
||||
* Scalability to millions of vectors
|
||||
* Kubernetes-native deployment
|
||||
* Active development and community
|
||||
* Support for metadata filtering
|
||||
* Backup and restore capabilities
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Milvus
|
||||
* Pinecone (managed)
|
||||
* Qdrant
|
||||
* Weaviate
|
||||
* pgvector (PostgreSQL extension)
|
||||
* Chroma
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* High-performance similarity search
|
||||
* Horizontal scalability
|
||||
* Rich filtering and hybrid search
|
||||
* Helm chart for Kubernetes
|
||||
* Active CNCF sandbox project
|
||||
* GPU acceleration available
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Complex architecture (multiple components)
|
||||
* Higher resource usage than simpler alternatives
|
||||
* Requires object storage (MinIO)
|
||||
* Learning curve for optimization
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Milvus
|
||||
|
||||
* Good, because production-proven at scale
|
||||
* Good, because rich query API
|
||||
* Good, because Kubernetes-native
|
||||
* Good, because hybrid search (vector + scalar)
|
||||
* Good, because CNCF project
|
||||
* Bad, because complex architecture
|
||||
* Bad, because higher resource usage
|
||||
|
||||
### Pinecone
|
||||
|
||||
* Good, because fully managed
|
||||
* Good, because simple API
|
||||
* Good, because reliable
|
||||
* Bad, because external dependency
|
||||
* Bad, because cost at scale
|
||||
* Bad, because data sovereignty concerns
|
||||
|
||||
### Qdrant
|
||||
|
||||
* Good, because simpler than Milvus
|
||||
* Good, because Rust performance
|
||||
* Good, because good filtering
|
||||
* Bad, because smaller community
|
||||
* Bad, because less enterprise features
|
||||
|
||||
### Weaviate
|
||||
|
||||
* Good, because built-in vectorization
|
||||
* Good, because GraphQL API
|
||||
* Good, because modules system
|
||||
* Bad, because more opinionated
|
||||
* Bad, because schema requirements
|
||||
|
||||
### pgvector
|
||||
|
||||
* Good, because familiar PostgreSQL
|
||||
* Good, because simple deployment
|
||||
* Good, because ACID transactions
|
||||
* Bad, because limited scale
|
||||
* Bad, because slower for large datasets
|
||||
* Bad, because no specialized optimizations
|
||||
|
||||
### Chroma
|
||||
|
||||
* Good, because simple
|
||||
* Good, because embedded option
|
||||
* Bad, because not production-ready at scale
|
||||
* Bad, because limited features
|
||||
|
||||
## Links
|
||||
|
||||
* [Milvus](https://milvus.io)
|
||||
* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
|
||||
* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities
|
||||
124
decisions/0009-dual-workflow-engines.md
Normal file
124
decisions/0009-dual-workflow-engines.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Dual Workflow Engine Strategy (Argo + Kubeflow)
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-01-15
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting workflow orchestration for ML pipelines
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The AI platform needs workflow orchestration for:
|
||||
- ML training pipelines with caching
|
||||
- Document ingestion (batch)
|
||||
- Complex DAG workflows (training → evaluation → deployment)
|
||||
- Hybrid scenarios combining both
|
||||
|
||||
Should we use one engine or leverage strengths of multiple?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* ML-specific features (caching, lineage)
|
||||
* Complex DAG support
|
||||
* Kubernetes-native execution
|
||||
* Visibility and debugging
|
||||
* Community and ecosystem
|
||||
* Integration with existing tools
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Kubeflow Pipelines only
|
||||
* Argo Workflows only
|
||||
* Both engines with clear use cases
|
||||
* Airflow on Kubernetes
|
||||
* Prefect/Dagster
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
| Use Case | Engine | Reason |
|
||||
|----------|--------|--------|
|
||||
| ML training with caching | Kubeflow | Component caching, experiment tracking |
|
||||
| Model evaluation | Kubeflow | Metric collection, comparison |
|
||||
| Document ingestion | Argo | Simple DAG, no ML features needed |
|
||||
| Batch inference | Argo | Parallelization, retries |
|
||||
| Complex DAG with branching | Argo | Superior control flow |
|
||||
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Best tool for each job
|
||||
* ML pipelines get proper caching
|
||||
* Complex workflows get better DAG support
|
||||
* Can integrate via Argo Events
|
||||
* Gradual migration possible
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Two systems to maintain
|
||||
* Team needs to learn both
|
||||
* More complex debugging
|
||||
* Integration overhead
|
||||
|
||||
## Integration Architecture
|
||||
|
||||
```
|
||||
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
|
||||
│
|
||||
└──► Kubeflow Pipeline (via API)
|
||||
|
||||
OR
|
||||
|
||||
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
|
||||
(WorkflowTemplate)
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Kubeflow Pipelines only
|
||||
|
||||
* Good, because ML-focused
|
||||
* Good, because caching
|
||||
* Good, because experiment tracking
|
||||
* Bad, because limited DAG features
|
||||
* Bad, because less flexible control flow
|
||||
|
||||
### Argo Workflows only
|
||||
|
||||
* Good, because powerful DAG
|
||||
* Good, because flexible
|
||||
* Good, because great debugging
|
||||
* Bad, because no ML caching
|
||||
* Bad, because no experiment tracking
|
||||
|
||||
### Both engines (chosen)
|
||||
|
||||
* Good, because best of both
|
||||
* Good, because appropriate tool per job
|
||||
* Good, because can integrate
|
||||
* Bad, because operational complexity
|
||||
* Bad, because learning two systems
|
||||
|
||||
### Airflow
|
||||
|
||||
* Good, because mature
|
||||
* Good, because large community
|
||||
* Bad, because Python-centric
|
||||
* Bad, because not Kubernetes-native
|
||||
* Bad, because no ML features
|
||||
|
||||
### Prefect/Dagster
|
||||
|
||||
* Good, because modern design
|
||||
* Good, because Python-native
|
||||
* Bad, because less Kubernetes-native
|
||||
* Bad, because newer/less proven
|
||||
|
||||
## Links
|
||||
|
||||
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
||||
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
||||
* [Argo Events](https://argoproj.github.io/events/)
|
||||
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
|
||||
120
decisions/0010-use-envoy-gateway.md
Normal file
120
decisions/0010-use-envoy-gateway.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Use Envoy Gateway for Ingress
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting ingress controller for cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We need an ingress solution that supports:
|
||||
- Gateway API (modern Kubernetes standard)
|
||||
- gRPC for ML inference
|
||||
- WebSocket for real-time chat/voice
|
||||
- Header-based routing for A/B testing
|
||||
- TLS termination
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Gateway API support (HTTPRoute, GRPCRoute)
|
||||
* WebSocket support
|
||||
* gRPC support
|
||||
* Performance at edge
|
||||
* Active development
|
||||
* Envoy ecosystem familiarity
|
||||
|
||||
## Considered Options
|
||||
|
||||
* NGINX Ingress Controller
|
||||
* Traefik
|
||||
* Envoy Gateway
|
||||
* Istio Gateway
|
||||
* Contour
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Native Gateway API support
|
||||
* Full Envoy feature set
|
||||
* WebSocket and gRPC native
|
||||
* No Istio complexity
|
||||
* CNCF graduated project (Envoy)
|
||||
* Easy integration with observability
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Newer than alternatives
|
||||
* Less documentation than NGINX
|
||||
* Envoy configuration learning curve
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### NGINX Ingress
|
||||
|
||||
* Good, because mature
|
||||
* Good, because well-documented
|
||||
* Good, because familiar
|
||||
* Bad, because limited Gateway API
|
||||
* Bad, because commercial features gated
|
||||
|
||||
### Traefik
|
||||
|
||||
* Good, because auto-discovery
|
||||
* Good, because good UI
|
||||
* Good, because Let's Encrypt
|
||||
* Bad, because Gateway API experimental
|
||||
* Bad, because less gRPC focus
|
||||
|
||||
### Envoy Gateway
|
||||
|
||||
* Good, because Gateway API native
|
||||
* Good, because full Envoy features
|
||||
* Good, because extensible
|
||||
* Good, because gRPC/WebSocket native
|
||||
* Bad, because newer project
|
||||
* Bad, because less community content
|
||||
|
||||
### Istio Gateway
|
||||
|
||||
* Good, because full mesh features
|
||||
* Good, because Gateway API
|
||||
* Bad, because overkill without mesh
|
||||
* Bad, because resource heavy
|
||||
|
||||
### Contour
|
||||
|
||||
* Good, because Envoy-based
|
||||
* Good, because lightweight
|
||||
* Bad, because Gateway API evolving
|
||||
* Bad, because smaller community
|
||||
|
||||
## Configuration Example
|
||||
|
||||
```yaml
|
||||
apiVersion: gateway.networking.k8s.io/v1
|
||||
kind: HTTPRoute
|
||||
metadata:
|
||||
name: companions-chat
|
||||
spec:
|
||||
parentRefs:
|
||||
- name: eg-gateway
|
||||
namespace: network
|
||||
hostnames:
|
||||
- companions-chat.lab.daviestechlabs.io
|
||||
rules:
|
||||
- matches:
|
||||
- path:
|
||||
type: PathPrefix
|
||||
value: /
|
||||
backendRefs:
|
||||
- name: companions-chat
|
||||
port: 8080
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* [Envoy Gateway](https://gateway.envoyproxy.io)
|
||||
* [Gateway API](https://gateway-api.sigs.k8s.io)
|
||||
Reference in New Issue
Block a user