Files

Billy D. 832cda34bd feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.

2026-02-01 14:30:05 -05:00

8.5 KiB

Raw Blame History

📐 Coding Conventions

Patterns, practices, and folder structure conventions for DaviesTechLabs repositories

Repository Conventions

homelab-k8s2 (Infrastructure)

kubernetes/
├── apps/                    # Application deployments
│   └── {namespace}/         # One folder per namespace
│       └── {app}/           # One folder per application
│           ├── app/         # Kubernetes manifests
│           │   ├── kustomization.yaml
│           │   ├── helmrelease.yaml   # OR individual manifests
│           │   └── ...
│           └── ks.yaml      # Flux Kustomization
├── components/              # Reusable Kustomize components
└── flux/                    # Flux system configuration

Naming Conventions:

Namespaces: lowercase with hyphens (ai-ml, cert-manager)
Apps: lowercase with hyphens (chat-handler, voice-assistant)
Secrets: {app}-{type} (e.g., milvus-credentials)

llm-workflows (Orchestration)

workflows/                   # Kubernetes Deployments for NATS handlers
├── {handler}.yaml           # One file per handler

argo/                        # Argo WorkflowTemplates
├── {workflow-name}.yaml     # One file per workflow

pipelines/                   # Kubeflow Pipeline Python files
├── {pipeline}_pipeline.py   # Pipeline definition
└── kfp-sync-job.yaml       # Upload job

{handler}/                   # Python source code
├── __init__.py
├── {handler}.py            # Main entry point
├── requirements.txt
└── Dockerfile

Python Conventions

Project Structure

# Use async/await for I/O
async def handle_message(msg: Msg) -> None:
    ...

# Use dataclasses for structured data
@dataclass
class ChatRequest:
    user_id: str
    message: str
    enable_rag: bool = True

# Use msgpack for NATS messages
import msgpack
data = msgpack.packb({"key": "value"})

Naming

Element	Convention	Example
Files	snake_case	`chat_handler.py`
Classes	PascalCase	`ChatHandler`
Functions	snake_case	`process_message`
Constants	UPPER_SNAKE	`NATS_URL`
Private	Leading underscore	`_internal_method`

Type Hints

# Always use type hints
from typing import Optional, List, Dict, Any

async def query_rag(
    query: str,
    collection: str = "knowledge_base",
    top_k: int = 5,
) -> List[Dict[str, Any]]:
    ...

Error Handling

# Use specific exceptions
class RAGQueryError(Exception):
    """Raised when RAG query fails."""
    pass

# Log errors with context
import logging
logger = logging.getLogger(__name__)

try:
    result = await milvus.search(...)
except Exception as e:
    logger.error(f"RAG query failed: {e}", extra={"query": query})
    raise RAGQueryError(f"Failed to query collection {collection}") from e

NATS Message Handling

import nats
import msgpack

async def message_handler(msg: Msg) -> None:
    try:
        # Decode MessagePack
        data = msgpack.unpackb(msg.data, raw=False)
        
        # Process
        result = await process(data)
        
        # Reply if request-reply pattern
        if msg.reply:
            await msg.respond(msgpack.packb(result))
        
        # Acknowledge for JetStream
        await msg.ack()
        
    except Exception as e:
        logger.error(f"Handler error: {e}")
        # NAK for retry (JetStream)
        await msg.nak()

Kubernetes Manifest Conventions

Labels

metadata:
  labels:
    # Required
    app.kubernetes.io/name: chat-handler
    app.kubernetes.io/instance: chat-handler
    app.kubernetes.io/component: handler
    app.kubernetes.io/part-of: ai-platform
    
    # Optional
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: flux

Annotations

metadata:
  annotations:
    # Reloader for config changes
    reloader.stakater.com/auto: "true"
    
    # Documentation
    description: "Handles chat messages via NATS"

Resource Requests

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
    
# GPU workloads
resources:
  limits:
    amd.com/gpu: 1        # AMD
    nvidia.com/gpu: 1     # NVIDIA

Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Flux/GitOps Conventions

Kustomization Structure

# ks.yaml - Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: &app chat-handler
  namespace: flux-system
spec:
  targetNamespace: ai-ml
  commonMetadata:
    labels:
      app.kubernetes.io/name: *app
  path: ./kubernetes/apps/ai-ml/chat-handler/app
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  wait: true
  interval: 30m
  retryInterval: 1m
  timeout: 5m

HelmRelease Structure

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: milvus
spec:
  interval: 30m
  chart:
    spec:
      chart: milvus
      version: 4.x.x
      sourceRef:
        kind: HelmRepository
        name: milvus
        namespace: flux-system
  values:
    # Values here

Secret References

# Never hardcode secrets
env:
  - name: DATABASE_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

NATS Subject Conventions

Hierarchy

ai.{domain}.{scope}.{action}

Examples:
ai.chat.user.{userId}.message      # User chat message
ai.chat.response.{requestId}       # Chat response
ai.voice.user.{userId}.request     # Voice request
ai.pipeline.trigger                # Pipeline trigger

Wildcards

ai.chat.>                   # All chat events
ai.chat.user.*.message      # All user messages
ai.*.response.{id}          # Any response type

Git Conventions

Commit Messages

type(scope): subject

body (optional)

footer (optional)

Types:

feat: New feature
fix: Bug fix
docs: Documentation
style: Formatting
refactor: Code restructuring
test: Tests
chore: Maintenance

Examples:

feat(chat-handler): add streaming response support
fix(voice): handle empty audio gracefully
docs(adr): add decision for MessagePack format

Branch Naming

feature/short-description
fix/issue-number-description
docs/what-changed

Configuration Conventions

Environment Variables

# Use pydantic-settings or similar
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    nats_url: str = "nats://localhost:4222"
    vllm_url: str = "http://localhost:8000"
    milvus_host: str = "localhost"
    milvus_port: int = 19530
    log_level: str = "INFO"
    
    class Config:
        env_prefix = ""  # No prefix

ConfigMaps

apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-services-config
data:
  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
  # ... other non-sensitive config

Documentation Conventions

ADR Format

See decisions/0000-template.md

Code Comments

# Use docstrings for public functions
async def query_rag(query: str) -> List[Dict]:
    """
    Query the RAG system for relevant documents.
    
    Args:
        query: The search query string
        
    Returns:
        List of document chunks with scores
        
    Raises:
        RAGQueryError: If the query fails
    """
    ...

README Files

Each application should have a README with:

Purpose
Configuration
Deployment
Local development
API documentation (if applicable)

Anti-Patterns to Avoid

Don't	Do Instead
`kubectl apply` directly	Commit to Git, let Flux deploy
Hardcode secrets	Use External Secrets Operator
Use `latest` image tags	Pin to specific versions
Skip health checks	Always define liveness/readiness
Ignore resource limits	Set appropriate requests/limits
Use JSON for NATS messages	Use MessagePack (binary)
Synchronous I/O in handlers	Use async/await

TECH-STACK.md - Technologies used
ARCHITECTURE.md - System design
decisions/ - Why we made certain choices

8.5 KiB Raw Blame History

📐 Coding Conventions

Repository Conventions

homelab-k8s2 (Infrastructure)

llm-workflows (Orchestration)

Python Conventions

Project Structure

Naming

Type Hints

Error Handling

NATS Message Handling

Kubernetes Manifest Conventions

Labels

Annotations

Resource Requests

Health Checks

Flux/GitOps Conventions

Kustomization Structure

HelmRelease Structure

Secret References

NATS Subject Conventions

Hierarchy

Wildcards

Git Conventions

Commit Messages

Branch Naming

Configuration Conventions

Environment Variables

ConfigMaps

Documentation Conventions

ADR Format

Code Comments

README Files

Anti-Patterns to Avoid

Related Documents

8.5 KiB

Raw Blame History