- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
8.5 KiB
8.5 KiB
📐 Coding Conventions
Patterns, practices, and folder structure conventions for DaviesTechLabs repositories
Repository Conventions
homelab-k8s2 (Infrastructure)
kubernetes/
├── apps/ # Application deployments
│ └── {namespace}/ # One folder per namespace
│ └── {app}/ # One folder per application
│ ├── app/ # Kubernetes manifests
│ │ ├── kustomization.yaml
│ │ ├── helmrelease.yaml # OR individual manifests
│ │ └── ...
│ └── ks.yaml # Flux Kustomization
├── components/ # Reusable Kustomize components
└── flux/ # Flux system configuration
Naming Conventions:
- Namespaces: lowercase with hyphens (
ai-ml,cert-manager) - Apps: lowercase with hyphens (
chat-handler,voice-assistant) - Secrets:
{app}-{type}(e.g.,milvus-credentials)
llm-workflows (Orchestration)
workflows/ # Kubernetes Deployments for NATS handlers
├── {handler}.yaml # One file per handler
argo/ # Argo WorkflowTemplates
├── {workflow-name}.yaml # One file per workflow
pipelines/ # Kubeflow Pipeline Python files
├── {pipeline}_pipeline.py # Pipeline definition
└── kfp-sync-job.yaml # Upload job
{handler}/ # Python source code
├── __init__.py
├── {handler}.py # Main entry point
├── requirements.txt
└── Dockerfile
Python Conventions
Project Structure
# Use async/await for I/O
async def handle_message(msg: Msg) -> None:
...
# Use dataclasses for structured data
@dataclass
class ChatRequest:
user_id: str
message: str
enable_rag: bool = True
# Use msgpack for NATS messages
import msgpack
data = msgpack.packb({"key": "value"})
Naming
| Element | Convention | Example |
|---|---|---|
| Files | snake_case | chat_handler.py |
| Classes | PascalCase | ChatHandler |
| Functions | snake_case | process_message |
| Constants | UPPER_SNAKE | NATS_URL |
| Private | Leading underscore | _internal_method |
Type Hints
# Always use type hints
from typing import Optional, List, Dict, Any
async def query_rag(
query: str,
collection: str = "knowledge_base",
top_k: int = 5,
) -> List[Dict[str, Any]]:
...
Error Handling
# Use specific exceptions
class RAGQueryError(Exception):
"""Raised when RAG query fails."""
pass
# Log errors with context
import logging
logger = logging.getLogger(__name__)
try:
result = await milvus.search(...)
except Exception as e:
logger.error(f"RAG query failed: {e}", extra={"query": query})
raise RAGQueryError(f"Failed to query collection {collection}") from e
NATS Message Handling
import nats
import msgpack
async def message_handler(msg: Msg) -> None:
try:
# Decode MessagePack
data = msgpack.unpackb(msg.data, raw=False)
# Process
result = await process(data)
# Reply if request-reply pattern
if msg.reply:
await msg.respond(msgpack.packb(result))
# Acknowledge for JetStream
await msg.ack()
except Exception as e:
logger.error(f"Handler error: {e}")
# NAK for retry (JetStream)
await msg.nak()
Kubernetes Manifest Conventions
Labels
metadata:
labels:
# Required
app.kubernetes.io/name: chat-handler
app.kubernetes.io/instance: chat-handler
app.kubernetes.io/component: handler
app.kubernetes.io/part-of: ai-platform
# Optional
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/managed-by: flux
Annotations
metadata:
annotations:
# Reloader for config changes
reloader.stakater.com/auto: "true"
# Documentation
description: "Handles chat messages via NATS"
Resource Requests
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# GPU workloads
resources:
limits:
amd.com/gpu: 1 # AMD
nvidia.com/gpu: 1 # NVIDIA
Health Checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Flux/GitOps Conventions
Kustomization Structure
# ks.yaml - Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: &app chat-handler
namespace: flux-system
spec:
targetNamespace: ai-ml
commonMetadata:
labels:
app.kubernetes.io/name: *app
path: ./kubernetes/apps/ai-ml/chat-handler/app
prune: true
sourceRef:
kind: GitRepository
name: flux-system
wait: true
interval: 30m
retryInterval: 1m
timeout: 5m
HelmRelease Structure
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: milvus
spec:
interval: 30m
chart:
spec:
chart: milvus
version: 4.x.x
sourceRef:
kind: HelmRepository
name: milvus
namespace: flux-system
values:
# Values here
Secret References
# Never hardcode secrets
env:
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
NATS Subject Conventions
Hierarchy
ai.{domain}.{scope}.{action}
Examples:
ai.chat.user.{userId}.message # User chat message
ai.chat.response.{requestId} # Chat response
ai.voice.user.{userId}.request # Voice request
ai.pipeline.trigger # Pipeline trigger
Wildcards
ai.chat.> # All chat events
ai.chat.user.*.message # All user messages
ai.*.response.{id} # Any response type
Git Conventions
Commit Messages
type(scope): subject
body (optional)
footer (optional)
Types:
feat: New featurefix: Bug fixdocs: Documentationstyle: Formattingrefactor: Code restructuringtest: Testschore: Maintenance
Examples:
feat(chat-handler): add streaming response support
fix(voice): handle empty audio gracefully
docs(adr): add decision for MessagePack format
Branch Naming
feature/short-description
fix/issue-number-description
docs/what-changed
Configuration Conventions
Environment Variables
# Use pydantic-settings or similar
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
nats_url: str = "nats://localhost:4222"
vllm_url: str = "http://localhost:8000"
milvus_host: str = "localhost"
milvus_port: int = 19530
log_level: str = "INFO"
class Config:
env_prefix = "" # No prefix
ConfigMaps
apiVersion: v1
kind: ConfigMap
metadata:
name: ai-services-config
data:
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
# ... other non-sensitive config
Documentation Conventions
ADR Format
See decisions/0000-template.md
Code Comments
# Use docstrings for public functions
async def query_rag(query: str) -> List[Dict]:
"""
Query the RAG system for relevant documents.
Args:
query: The search query string
Returns:
List of document chunks with scores
Raises:
RAGQueryError: If the query fails
"""
...
README Files
Each application should have a README with:
- Purpose
- Configuration
- Deployment
- Local development
- API documentation (if applicable)
Anti-Patterns to Avoid
| Don't | Do Instead |
|---|---|
kubectl apply directly |
Commit to Git, let Flux deploy |
| Hardcode secrets | Use External Secrets Operator |
Use latest image tags |
Pin to specific versions |
| Skip health checks | Always define liveness/readiness |
| Ignore resource limits | Set appropriate requests/limits |
| Use JSON for NATS messages | Use MessagePack (binary) |
| Synchronous I/O in handlers | Use async/await |
Related Documents
- TECH-STACK.md - Technologies used
- ARCHITECTURE.md - System design
- decisions/ - Why we made certain choices