feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/CODING-CONVENTIONS.md
+++ b/CODING-CONVENTIONS.md
@@ -0,0 +1,424 @@
+# 📐 Coding Conventions
+
+> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
+
+## Repository Conventions
+
+### homelab-k8s2 (Infrastructure)
+
+```
+kubernetes/
+├── apps/                    # Application deployments
+│   └── {namespace}/         # One folder per namespace
+│       └── {app}/           # One folder per application
+│           ├── app/         # Kubernetes manifests
+│           │   ├── kustomization.yaml
+│           │   ├── helmrelease.yaml   # OR individual manifests
+│           │   └── ...
+│           └── ks.yaml      # Flux Kustomization
+├── components/              # Reusable Kustomize components
+└── flux/                    # Flux system configuration
+```
+
+**Naming Conventions:**
+- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
+- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
+- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
+
+### llm-workflows (Orchestration)
+
+```
+workflows/                   # Kubernetes Deployments for NATS handlers
+├── {handler}.yaml           # One file per handler
+
+argo/                        # Argo WorkflowTemplates
+├── {workflow-name}.yaml     # One file per workflow
+
+pipelines/                   # Kubeflow Pipeline Python files
+├── {pipeline}_pipeline.py   # Pipeline definition
+└── kfp-sync-job.yaml       # Upload job
+
+{handler}/                   # Python source code
+├── __init__.py
+├── {handler}.py            # Main entry point
+├── requirements.txt
+└── Dockerfile
+```
+
+---
+
+## Python Conventions
+
+### Project Structure
+
+```python
+# Use async/await for I/O
+async def handle_message(msg: Msg) -> None:
+    ...
+
+# Use dataclasses for structured data
+@dataclass
+class ChatRequest:
+    user_id: str
+    message: str
+    enable_rag: bool = True
+
+# Use msgpack for NATS messages
+import msgpack
+data = msgpack.packb({"key": "value"})
+```
+
+### Naming
+
+| Element | Convention | Example |
+|---------|------------|---------|
+| Files | snake_case | `chat_handler.py` |
+| Classes | PascalCase | `ChatHandler` |
+| Functions | snake_case | `process_message` |
+| Constants | UPPER_SNAKE | `NATS_URL` |
+| Private | Leading underscore | `_internal_method` |
+
+### Type Hints
+
+```python
+# Always use type hints
+from typing import Optional, List, Dict, Any
+
+async def query_rag(
+    query: str,
+    collection: str = "knowledge_base",
+    top_k: int = 5,
+) -> List[Dict[str, Any]]:
+    ...
+```
+
+### Error Handling
+
+```python
+# Use specific exceptions
+class RAGQueryError(Exception):
+    """Raised when RAG query fails."""
+    pass
+
+# Log errors with context
+import logging
+logger = logging.getLogger(__name__)
+
+try:
+    result = await milvus.search(...)
+except Exception as e:
+    logger.error(f"RAG query failed: {e}", extra={"query": query})
+    raise RAGQueryError(f"Failed to query collection {collection}") from e
+```
+
+### NATS Message Handling
+
+```python
+import nats
+import msgpack
+
+async def message_handler(msg: Msg) -> None:
+    try:
+        # Decode MessagePack
+        data = msgpack.unpackb(msg.data, raw=False)
+        
+        # Process
+        result = await process(data)
+        
+        # Reply if request-reply pattern
+        if msg.reply:
+            await msg.respond(msgpack.packb(result))
+        
+        # Acknowledge for JetStream
+        await msg.ack()
+        
+    except Exception as e:
+        logger.error(f"Handler error: {e}")
+        # NAK for retry (JetStream)
+        await msg.nak()
+```
+
+---
+
+## Kubernetes Manifest Conventions
+
+### Labels
+
+```yaml
+metadata:
+  labels:
+    # Required
+    app.kubernetes.io/name: chat-handler
+    app.kubernetes.io/instance: chat-handler
+    app.kubernetes.io/component: handler
+    app.kubernetes.io/part-of: ai-platform
+    
+    # Optional
+    app.kubernetes.io/version: "1.0.0"
+    app.kubernetes.io/managed-by: flux
+```
+
+### Annotations
+
+```yaml
+metadata:
+  annotations:
+    # Reloader for config changes
+    reloader.stakater.com/auto: "true"
+    
+    # Documentation
+    description: "Handles chat messages via NATS"
+```
+
+### Resource Requests
+
+```yaml
+resources:
+  requests:
+    cpu: 100m
+    memory: 256Mi
+  limits:
+    cpu: 500m
+    memory: 512Mi
+    
+# GPU workloads
+resources:
+  limits:
+    amd.com/gpu: 1        # AMD
+    nvidia.com/gpu: 1     # NVIDIA
+```
+
+### Health Checks
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /health
+    port: 8080
+  initialDelaySeconds: 10
+  periodSeconds: 30
+
+readinessProbe:
+  httpGet:
+    path: /ready
+    port: 8080
+  initialDelaySeconds: 5
+  periodSeconds: 10
+```
+
+---
+
+## Flux/GitOps Conventions
+
+### Kustomization Structure
+
+```yaml
+# ks.yaml - Flux Kustomization
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+  name: &app chat-handler
+  namespace: flux-system
+spec:
+  targetNamespace: ai-ml
+  commonMetadata:
+    labels:
+      app.kubernetes.io/name: *app
+  path: ./kubernetes/apps/ai-ml/chat-handler/app
+  prune: true
+  sourceRef:
+    kind: GitRepository
+    name: flux-system
+  wait: true
+  interval: 30m
+  retryInterval: 1m
+  timeout: 5m
+```
+
+### HelmRelease Structure
+
+```yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: milvus
+spec:
+  interval: 30m
+  chart:
+    spec:
+      chart: milvus
+      version: 4.x.x
+      sourceRef:
+        kind: HelmRepository
+        name: milvus
+        namespace: flux-system
+  values:
+    # Values here
+```
+
+### Secret References
+
+```yaml
+# Never hardcode secrets
+env:
+  - name: DATABASE_PASSWORD
+    valueFrom:
+      secretKeyRef:
+        name: postgres-credentials
+        key: password
+```
+
+---
+
+## NATS Subject Conventions
+
+### Hierarchy
+
+```
+ai.{domain}.{scope}.{action}
+
+Examples:
+ai.chat.user.{userId}.message      # User chat message
+ai.chat.response.{requestId}       # Chat response
+ai.voice.user.{userId}.request     # Voice request
+ai.pipeline.trigger                # Pipeline trigger
+```
+
+### Wildcards
+
+```
+ai.chat.>                   # All chat events
+ai.chat.user.*.message      # All user messages
+ai.*.response.{id}          # Any response type
+```
+
+---
+
+## Git Conventions
+
+### Commit Messages
+
+```
+type(scope): subject
+
+body (optional)
+
+footer (optional)
+```
+
+**Types:**
+- `feat`: New feature
+- `fix`: Bug fix
+- `docs`: Documentation
+- `style`: Formatting
+- `refactor`: Code restructuring
+- `test`: Tests
+- `chore`: Maintenance
+
+**Examples:**
+```
+feat(chat-handler): add streaming response support
+fix(voice): handle empty audio gracefully
+docs(adr): add decision for MessagePack format
+```
+
+### Branch Naming
+
+```
+feature/short-description
+fix/issue-number-description
+docs/what-changed
+```
+
+---
+
+## Configuration Conventions
+
+### Environment Variables
+
+```python
+# Use pydantic-settings or similar
+from pydantic_settings import BaseSettings
+
+class Settings(BaseSettings):
+    nats_url: str = "nats://localhost:4222"
+    vllm_url: str = "http://localhost:8000"
+    milvus_host: str = "localhost"
+    milvus_port: int = 19530
+    log_level: str = "INFO"
+    
+    class Config:
+        env_prefix = ""  # No prefix
+```
+
+### ConfigMaps
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: ai-services-config
+data:
+  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
+  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
+  # ... other non-sensitive config
+```
+
+---
+
+## Documentation Conventions
+
+### ADR Format
+
+See [decisions/0000-template.md](decisions/0000-template.md)
+
+### Code Comments
+
+```python
+# Use docstrings for public functions
+async def query_rag(query: str) -> List[Dict]:
+    """
+    Query the RAG system for relevant documents.
+    
+    Args:
+        query: The search query string
+        
+    Returns:
+        List of document chunks with scores
+        
+    Raises:
+        RAGQueryError: If the query fails
+    """
+    ...
+```
+
+### README Files
+
+Each application should have a README with:
+1. Purpose
+2. Configuration
+3. Deployment
+4. Local development
+5. API documentation (if applicable)
+
+---
+
+## Anti-Patterns to Avoid
+
+| Don't | Do Instead |
+|-------|------------|
+| `kubectl apply` directly | Commit to Git, let Flux deploy |
+| Hardcode secrets | Use External Secrets Operator |
+| Use `latest` image tags | Pin to specific versions |
+| Skip health checks | Always define liveness/readiness |
+| Ignore resource limits | Set appropriate requests/limits |
+| Use JSON for NATS messages | Use MessagePack (binary) |
+| Synchronous I/O in handlers | Use async/await |
+
+---
+
+## Related Documents
+
+- [TECH-STACK.md](TECH-STACK.md) - Technologies used
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
+- [decisions/](decisions/) - Why we made certain choices