daviestechlabs/homelab-design

Fork 0

Files

Billy D. 100ba21eba

Update README with ADR Index / update-readme (push) Successful in 1m2s

Details

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

12 KiB

Raw Blame History

📐 Coding Conventions

Patterns, practices, and folder structure conventions for DaviesTechLabs repositories

Repository Conventions

homelab-k8s2 (Infrastructure)

kubernetes/
├── apps/                    # Application deployments
│   └── {namespace}/         # One folder per namespace
│       └── {app}/           # One folder per application
│           ├── app/         # Kubernetes manifests
│           │   ├── kustomization.yaml
│           │   ├── helmrelease.yaml   # OR individual manifests
│           │   └── ...
│           └── ks.yaml      # Flux Kustomization
├── components/              # Reusable Kustomize components
└── flux/                    # Flux system configuration

Naming Conventions:

Namespaces: lowercase with hyphens (ai-ml, cert-manager)
Apps: lowercase with hyphens (chat-handler, voice-assistant)
Secrets: {app}-{type} (e.g., milvus-credentials)

AI/ML Repos (git.daviestechlabs.io/daviestechlabs)

handler-base/                # Shared Go module for all NATS handlers
├── clients/                 #   HTTP clients (LLM, STT, TTS, embeddings, reranker)
├── config/                  #   Env-based configuration (struct tags)
├── gen/messagespb/          #   Generated protobuf stubs
├── handler/                 #   Typed NATS message handler with OTel + health wiring
├── health/                  #   HTTP health + readiness server
├── messages/                #   Type aliases from generated protobuf stubs
├── natsutil/                #   NATS publish/request with protobuf encoding
├── proto/messages/v1/       #   .proto schema source
├── go.mod
└── buf.yaml                 #   buf protobuf toolchain config

chat-handler/                # Text chat service (Go)
voice-assistant/             # Voice pipeline service (Go)
pipeline-bridge/             # Workflow engine bridge (Go)
stt-module/                  # Speech-to-text bridge (Go)
tts-module/                  # Text-to-speech bridge (Go)
├── main.go                  # Service entry point
├── main_test.go             # Unit tests
├── e2e_test.go              # End-to-end tests
├── go.mod                   # Go module (depends on handler-base)
├── Dockerfile               # Distroless container (~20 MB)
└── renovate.json            # Dependency update config

argo/                        # Argo WorkflowTemplates
├── {workflow-name}.yaml

kubeflow/                    # Kubeflow Pipelines
├── {pipeline}_pipeline.py

kuberay-images/              # GPU worker images
├── dockerfiles/
└── ray-serve/

Python Conventions

Package Management (ADR-0012)

Use uv for local development and pip in Docker for reproducibility:

# Install uv (one-time)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Or use uv sync with lock file
uv sync

# Update lock file after changing pyproject.toml
uv lock

# Run tests
uv run pytest

Code Formatting & Linting (Ruff)

All Python code must pass ruff check and ruff format before merge. Ruff is configured in each repo's pyproject.toml:

[tool.ruff]
line-length = 100
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "W", "I", "UP", "B", "C4", "SIM"]
ignore = ["E501"]  # Line length handled by formatter

[tool.ruff.format]
quote-style = "double"

Required dev dependency:

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "pytest-asyncio>=0.23.0",
    "pytest-cov>=4.0.0",  # For coverage in handler-base
    "ruff>=0.1.0",
]

Local workflow:

# Check and auto-fix
uv run ruff check --fix .

# Format code
uv run ruff format .

# Verify before commit
uv run ruff check . && uv run ruff format --check .

CI enforcement: All repos run ruff in the lint job. Commits that fail linting will not pass CI.

Kubeflow pipeline variables: For Kubeflow DSL pipelines, terminal task assignments that appear unused should have # noqa: F841 comments, as these define the DAG structure:

# Step 6: Final step (defines DAG dependency)
tts_task = synthesize_speech(text=llm_task.output)  # noqa: F841

Project Structure

// Go handler services use handler-base shared module
import (
    "git.daviestechlabs.io/daviestechlabs/handler-base/clients"
    "git.daviestechlabs.io/daviestechlabs/handler-base/config"
    "git.daviestechlabs.io/daviestechlabs/handler-base/handler"
    "git.daviestechlabs.io/daviestechlabs/handler-base/health"
    "git.daviestechlabs.io/daviestechlabs/handler-base/messages"
    "git.daviestechlabs.io/daviestechlabs/handler-base/natsutil"
)

# Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs
# Use async/await for I/O
async def handle_message(msg: Msg) -> None:
    ...

# Use dataclasses for structured data
@dataclass
class ChatRequest:
    user_id: str
    message: str
    enable_rag: bool = True

Naming

Element	Convention	Example
Files	snake_case	`chat_handler.py`
Classes	PascalCase	`ChatHandler`
Functions	snake_case	`process_message`
Constants	UPPER_SNAKE	`NATS_URL`
Private	Leading underscore	`_internal_method`

Type Hints

# Always use type hints
from typing import Optional, List, Dict, Any

async def query_rag(
    query: str,
    collection: str = "knowledge_base",
    top_k: int = 5,
) -> List[Dict[str, Any]]:
    ...

Error Handling

# Use specific exceptions
class RAGQueryError(Exception):
    """Raised when RAG query fails."""
    pass

# Log errors with context
import logging
logger = logging.getLogger(__name__)

try:
    result = await milvus.search(...)
except Exception as e:
    logger.error(f"RAG query failed: {e}", extra={"query": query})
    raise RAGQueryError(f"Failed to query collection {collection}") from e

NATS Message Handling

All NATS handler services use Go with Protocol Buffers encoding (see ADR-0061):

// Go NATS handler (production pattern)
func (h *Handler) handleMessage(msg *nats.Msg) {
    var req messages.ChatRequest
    if err := proto.Unmarshal(msg.Data, &req); err != nil {
        h.logger.Error("failed to unmarshal", "error", err)
        return
    }

    // Process
    result, err := h.process(ctx, &req)
    if err != nil {
        h.logger.Error("handler error", "error", err)
        msg.Nak()
        return
    }

    // Reply if request-reply pattern
    if msg.Reply != "" {
        data, _ := proto.Marshal(result)
        msg.Respond(data)
    }
    msg.Ack()
}

Python NATS is still used in Ray Serve runtime_env and Kubeflow pipeline components where needed, but all dedicated NATS handler services are Go.

Kubernetes Manifest Conventions

Labels

metadata:
  labels:
    # Required
    app.kubernetes.io/name: chat-handler
    app.kubernetes.io/instance: chat-handler
    app.kubernetes.io/component: handler
    app.kubernetes.io/part-of: ai-platform
    
    # Optional
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: flux

Annotations

metadata:
  annotations:
    # Reloader for config changes
    reloader.stakater.com/auto: "true"
    
    # Documentation
    description: "Handles chat messages via NATS"

Resource Requests

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
    
# GPU workloads
resources:
  limits:
    amd.com/gpu: 1        # AMD
    nvidia.com/gpu: 1     # NVIDIA

Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Flux/GitOps Conventions

Kustomization Structure

# ks.yaml - Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: &app chat-handler
  namespace: flux-system
spec:
  targetNamespace: ai-ml
  commonMetadata:
    labels:
      app.kubernetes.io/name: *app
  path: ./kubernetes/apps/ai-ml/chat-handler/app
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  wait: true
  interval: 30m
  retryInterval: 1m
  timeout: 5m

HelmRelease Structure

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: milvus
spec:
  interval: 30m
  chart:
    spec:
      chart: milvus
      version: 4.x.x
      sourceRef:
        kind: HelmRepository
        name: milvus
        namespace: flux-system
  values:
    # Values here

Secret References

# Never hardcode secrets
env:
  - name: DATABASE_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

NATS Subject Conventions

Hierarchy

ai.{domain}.{scope}.{action}

Examples:
ai.chat.user.{userId}.message      # User chat message
ai.chat.response.{requestId}       # Chat response
ai.voice.user.{userId}.request     # Voice request
ai.pipeline.trigger                # Pipeline trigger

Wildcards

ai.chat.>                   # All chat events
ai.chat.user.*.message      # All user messages
ai.*.response.{id}          # Any response type

Git Conventions

Commit Messages

type(scope): subject

body (optional)

footer (optional)

Types:

feat: New feature
fix: Bug fix
docs: Documentation
style: Formatting
refactor: Code restructuring
test: Tests
chore: Maintenance

Examples:

feat(chat-handler): add streaming response support
fix(voice): handle empty audio gracefully
docs(adr): add decision for MessagePack format

Branch Naming

feature/short-description
fix/issue-number-description
docs/what-changed

Configuration Conventions

Environment Variables

# Use pydantic-settings or similar
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    nats_url: str = "nats://localhost:4222"
    vllm_url: str = "http://localhost:8000"
    milvus_host: str = "localhost"
    milvus_port: int = 19530
    log_level: str = "INFO"
    
    class Config:
        env_prefix = ""  # No prefix

ConfigMaps

apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-services-config
data:
  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
  # ... other non-sensitive config

Documentation Conventions

ADR Format

See decisions/0000-template.md

Code Comments

# Use docstrings for public functions
async def query_rag(query: str) -> List[Dict]:
    """
    Query the RAG system for relevant documents.
    
    Args:
        query: The search query string
        
    Returns:
        List of document chunks with scores
        
    Raises:
        RAGQueryError: If the query fails
    """
    ...

README Files

Each application should have a README with:

Purpose
Configuration
Deployment
Local development
API documentation (if applicable)

Anti-Patterns to Avoid

Don't	Do Instead
`kubectl apply` directly	Commit to Git, let Flux deploy
Hardcode secrets	Use External Secrets Operator
Use `latest` image tags	Pin to specific versions
Skip health checks	Always define liveness/readiness
Ignore resource limits	Set appropriate requests/limits
Use JSON for NATS messages	Use Protocol Buffers (see ADR-0061)
Write handler services in Python	Use Go with handler-base module (ADR-0061)
Synchronous I/O in handlers	Use goroutines / async patterns

TECH-STACK.md - Technologies used
ARCHITECTURE.md - System design
decisions/ - Why we made certain choices

12 KiB Raw Blame History