homelab-design/CODING-CONVENTIONS.md

# 📐 Coding Conventions

> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**

## Repository Conventions

### homelab-k8s2 (Infrastructure)

```
kubernetes/
├── apps/                    # Application deployments
│   └── {namespace}/         # One folder per namespace
│       └── {app}/           # One folder per application
│           ├── app/         # Kubernetes manifests
│           │   ├── kustomization.yaml
│           │   ├── helmrelease.yaml   # OR individual manifests
│           │   └── ...
│           └── ks.yaml      # Flux Kustomization
├── components/              # Reusable Kustomize components
└── flux/                    # Flux system configuration
```

**Naming Conventions:**
- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)

### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)

```
handler-base/                # Shared Go module for all NATS handlers
├── clients/                 #   HTTP clients (LLM, STT, TTS, embeddings, reranker)
├── config/                  #   Env-based configuration (struct tags)
├── gen/messagespb/          #   Generated protobuf stubs
├── handler/                 #   Typed NATS message handler with OTel + health wiring
├── health/                  #   HTTP health + readiness server
├── messages/                #   Type aliases from generated protobuf stubs
├── natsutil/                #   NATS publish/request with protobuf encoding
├── proto/messages/v1/       #   .proto schema source
├── go.mod
└── buf.yaml                 #   buf protobuf toolchain config

chat-handler/                # Text chat service (Go)
voice-assistant/             # Voice pipeline service (Go)
pipeline-bridge/             # Workflow engine bridge (Go)
stt-module/                  # Speech-to-text bridge (Go)
tts-module/                  # Text-to-speech bridge (Go)
├── main.go                  # Service entry point
├── main_test.go             # Unit tests
├── e2e_test.go              # End-to-end tests
├── go.mod                   # Go module (depends on handler-base)
├── Dockerfile               # Distroless container (~20 MB)
└── renovate.json            # Dependency update config

argo/                        # Argo WorkflowTemplates
├── {workflow-name}.yaml

kubeflow/                    # Kubeflow Pipelines
├── {pipeline}_pipeline.py

kuberay-images/              # GPU worker images
├── dockerfiles/
└── ray-serve/
```

---

## Python Conventions

### Package Management (ADR-0012)

Use **uv** for local development and **pip** in Docker for reproducibility:

```bash
# Install uv (one-time)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Or use uv sync with lock file
uv sync

# Update lock file after changing pyproject.toml
uv lock

# Run tests
uv run pytest
```

### Code Formatting & Linting (Ruff)

All Python code must pass `ruff check` and `ruff format` before merge. Ruff is configured in each repo's `pyproject.toml`:

```toml
[tool.ruff]
line-length = 100
target-version = "py311"

[tool.ruff.lint]
select = ["E", "F", "W", "I", "UP", "B", "C4", "SIM"]
ignore = ["E501"]  # Line length handled by formatter

[tool.ruff.format]
quote-style = "double"
```

**Required dev dependency:**
```toml
[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "pytest-asyncio>=0.23.0",
    "pytest-cov>=4.0.0",  # For coverage in handler-base
    "ruff>=0.1.0",
]
```

**Local workflow:**
```bash
# Check and auto-fix
uv run ruff check --fix .

# Format code
uv run ruff format .

# Verify before commit
uv run ruff check . && uv run ruff format --check .
```

**CI enforcement:** All repos run ruff in the lint job. Commits that fail linting will not pass CI.

**Kubeflow pipeline variables:** For Kubeflow DSL pipelines, terminal task assignments that appear unused should have `# noqa: F841` comments, as these define the DAG structure:
```python
# Step 6: Final step (defines DAG dependency)
tts_task = synthesize_speech(text=llm_task.output)  # noqa: F841
```

### Project Structure

```go
// Go handler services use handler-base shared module
import (
    "git.daviestechlabs.io/daviestechlabs/handler-base/clients"
    "git.daviestechlabs.io/daviestechlabs/handler-base/config"
    "git.daviestechlabs.io/daviestechlabs/handler-base/handler"
    "git.daviestechlabs.io/daviestechlabs/handler-base/health"
    "git.daviestechlabs.io/daviestechlabs/handler-base/messages"
    "git.daviestechlabs.io/daviestechlabs/handler-base/natsutil"
)
```

```python
# Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs
# Use async/await for I/O
async def handle_message(msg: Msg) -> None:
    ...

# Use dataclasses for structured data
@dataclass
class ChatRequest:
    user_id: str
    message: str
    enable_rag: bool = True
```

### Naming

| Element | Convention | Example |
|---------|------------|---------|
| Files | snake_case | `chat_handler.py` |
| Classes | PascalCase | `ChatHandler` |
| Functions | snake_case | `process_message` |
| Constants | UPPER_SNAKE | `NATS_URL` |
| Private | Leading underscore | `_internal_method` |

### Type Hints

```python
# Always use type hints
from typing import Optional, List, Dict, Any

async def query_rag(
    query: str,
    collection: str = "knowledge_base",
    top_k: int = 5,
) -> List[Dict[str, Any]]:
    ...
```

### Error Handling

```python
# Use specific exceptions
class RAGQueryError(Exception):
    """Raised when RAG query fails."""
    pass

# Log errors with context
import logging
logger = logging.getLogger(__name__)

try:
    result = await milvus.search(...)
except Exception as e:
    logger.error(f"RAG query failed: {e}", extra={"query": query})
    raise RAGQueryError(f"Failed to query collection {collection}") from e
```

### NATS Message Handling

All NATS handler services use Go with Protocol Buffers encoding (see [ADR-0061](decisions/0061-go-handler-refactor.md)):

```go
// Go NATS handler (production pattern)
func (h *Handler) handleMessage(msg *nats.Msg) {
    var req messages.ChatRequest
    if err := proto.Unmarshal(msg.Data, &req); err != nil {
        h.logger.Error("failed to unmarshal", "error", err)
        return
    }

    // Process
    result, err := h.process(ctx, &req)
    if err != nil {
        h.logger.Error("handler error", "error", err)
        msg.Nak()
        return
    }

    // Reply if request-reply pattern
    if msg.Reply != "" {
        data, _ := proto.Marshal(result)
        msg.Respond(data)
    }
    msg.Ack()
}
```

> **Python NATS** is still used in Ray Serve `runtime_env` and Kubeflow pipeline components where needed, but all dedicated NATS handler services are Go.

---

## Kubernetes Manifest Conventions

### Labels

```yaml
metadata:
  labels:
    # Required
    app.kubernetes.io/name: chat-handler
    app.kubernetes.io/instance: chat-handler
    app.kubernetes.io/component: handler
    app.kubernetes.io/part-of: ai-platform

    # Optional
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: flux
```

### Annotations

```yaml
metadata:
  annotations:
    # Reloader for config changes
    reloader.stakater.com/auto: "true"

    # Documentation
    description: "Handles chat messages via NATS"
```

### Resource Requests

```yaml
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

# GPU workloads
resources:
  limits:
    amd.com/gpu: 1        # AMD
    nvidia.com/gpu: 1     # NVIDIA
```

### Health Checks

```yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
```

---

## Flux/GitOps Conventions

### Kustomization Structure

```yaml
# ks.yaml - Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: &app chat-handler
  namespace: flux-system
spec:
  targetNamespace: ai-ml
  commonMetadata:
    labels:
      app.kubernetes.io/name: *app
  path: ./kubernetes/apps/ai-ml/chat-handler/app
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  wait: true
  interval: 30m
  retryInterval: 1m
  timeout: 5m
```

### HelmRelease Structure

```yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: milvus
spec:
  interval: 30m
  chart:
    spec:
      chart: milvus
      version: 4.x.x
      sourceRef:
        kind: HelmRepository
        name: milvus
        namespace: flux-system
  values:
    # Values here
```

### Secret References

```yaml
# Never hardcode secrets
env:
  - name: DATABASE_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password
```

---

## NATS Subject Conventions

### Hierarchy

```
ai.{domain}.{scope}.{action}

Examples:
ai.chat.user.{userId}.message      # User chat message
ai.chat.response.{requestId}       # Chat response
ai.voice.user.{userId}.request     # Voice request
ai.pipeline.trigger                # Pipeline trigger
```

### Wildcards

```
ai.chat.>                   # All chat events
ai.chat.user.*.message      # All user messages
ai.*.response.{id}          # Any response type
```

---

## Git Conventions

### Commit Messages

```
type(scope): subject

body (optional)

footer (optional)
```

**Types:**
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation
- `style`: Formatting
- `refactor`: Code restructuring
- `test`: Tests
- `chore`: Maintenance

**Examples:**
```
feat(chat-handler): add streaming response support
fix(voice): handle empty audio gracefully
docs(adr): add decision for MessagePack format
```

### Branch Naming

```
feature/short-description
fix/issue-number-description
docs/what-changed
```

---

## Configuration Conventions

### Environment Variables

```python
# Use pydantic-settings or similar
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    nats_url: str = "nats://localhost:4222"
    vllm_url: str = "http://localhost:8000"
    milvus_host: str = "localhost"
    milvus_port: int = 19530
    log_level: str = "INFO"

    class Config:
        env_prefix = ""  # No prefix
```

### ConfigMaps

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-services-config
data:
  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
  # ... other non-sensitive config
```

---

## Documentation Conventions

### ADR Format

See [decisions/0000-template.md](decisions/0000-template.md)

### Code Comments

```python
# Use docstrings for public functions
async def query_rag(query: str) -> List[Dict]:
    """
    Query the RAG system for relevant documents.

    Args:
        query: The search query string

    Returns:
        List of document chunks with scores

    Raises:
        RAGQueryError: If the query fails
    """
    ...
```

### README Files

Each application should have a README with:
1. Purpose
2. Configuration
3. Deployment
4. Local development
5. API documentation (if applicable)

---

## Anti-Patterns to Avoid

| Don't | Do Instead |
|-------|------------|
| `kubectl apply` directly | Commit to Git, let Flux deploy |
| Hardcode secrets | Use External Secrets Operator |
| Use `latest` image tags | Pin to specific versions |
| Skip health checks | Always define liveness/readiness |
| Ignore resource limits | Set appropriate requests/limits |
| Use JSON for NATS messages | Use Protocol Buffers (see ADR-0061) |
| Write handler services in Python | Use Go with handler-base module (ADR-0061) |
| Synchronous I/O in handlers | Use goroutines / async patterns |

---

## Related Documents

- [TECH-STACK.md](TECH-STACK.md) - Technologies used
- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
- [decisions/](decisions/) - Why we made certain choices