feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
424
CODING-CONVENTIONS.md
Normal file
424
CODING-CONVENTIONS.md
Normal file
@@ -0,0 +1,424 @@
|
||||
# 📐 Coding Conventions
|
||||
|
||||
> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
|
||||
|
||||
## Repository Conventions
|
||||
|
||||
### homelab-k8s2 (Infrastructure)
|
||||
|
||||
```
|
||||
kubernetes/
|
||||
├── apps/ # Application deployments
|
||||
│ └── {namespace}/ # One folder per namespace
|
||||
│ └── {app}/ # One folder per application
|
||||
│ ├── app/ # Kubernetes manifests
|
||||
│ │ ├── kustomization.yaml
|
||||
│ │ ├── helmrelease.yaml # OR individual manifests
|
||||
│ │ └── ...
|
||||
│ └── ks.yaml # Flux Kustomization
|
||||
├── components/ # Reusable Kustomize components
|
||||
└── flux/ # Flux system configuration
|
||||
```
|
||||
|
||||
**Naming Conventions:**
|
||||
- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
|
||||
- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
|
||||
- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
|
||||
|
||||
### llm-workflows (Orchestration)
|
||||
|
||||
```
|
||||
workflows/ # Kubernetes Deployments for NATS handlers
|
||||
├── {handler}.yaml # One file per handler
|
||||
|
||||
argo/ # Argo WorkflowTemplates
|
||||
├── {workflow-name}.yaml # One file per workflow
|
||||
|
||||
pipelines/ # Kubeflow Pipeline Python files
|
||||
├── {pipeline}_pipeline.py # Pipeline definition
|
||||
└── kfp-sync-job.yaml # Upload job
|
||||
|
||||
{handler}/ # Python source code
|
||||
├── __init__.py
|
||||
├── {handler}.py # Main entry point
|
||||
├── requirements.txt
|
||||
└── Dockerfile
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Python Conventions
|
||||
|
||||
### Project Structure
|
||||
|
||||
```python
|
||||
# Use async/await for I/O
|
||||
async def handle_message(msg: Msg) -> None:
|
||||
...
|
||||
|
||||
# Use dataclasses for structured data
|
||||
@dataclass
|
||||
class ChatRequest:
|
||||
user_id: str
|
||||
message: str
|
||||
enable_rag: bool = True
|
||||
|
||||
# Use msgpack for NATS messages
|
||||
import msgpack
|
||||
data = msgpack.packb({"key": "value"})
|
||||
```
|
||||
|
||||
### Naming
|
||||
|
||||
| Element | Convention | Example |
|
||||
|---------|------------|---------|
|
||||
| Files | snake_case | `chat_handler.py` |
|
||||
| Classes | PascalCase | `ChatHandler` |
|
||||
| Functions | snake_case | `process_message` |
|
||||
| Constants | UPPER_SNAKE | `NATS_URL` |
|
||||
| Private | Leading underscore | `_internal_method` |
|
||||
|
||||
### Type Hints
|
||||
|
||||
```python
|
||||
# Always use type hints
|
||||
from typing import Optional, List, Dict, Any
|
||||
|
||||
async def query_rag(
|
||||
query: str,
|
||||
collection: str = "knowledge_base",
|
||||
top_k: int = 5,
|
||||
) -> List[Dict[str, Any]]:
|
||||
...
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Use specific exceptions
|
||||
class RAGQueryError(Exception):
|
||||
"""Raised when RAG query fails."""
|
||||
pass
|
||||
|
||||
# Log errors with context
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
try:
|
||||
result = await milvus.search(...)
|
||||
except Exception as e:
|
||||
logger.error(f"RAG query failed: {e}", extra={"query": query})
|
||||
raise RAGQueryError(f"Failed to query collection {collection}") from e
|
||||
```
|
||||
|
||||
### NATS Message Handling
|
||||
|
||||
```python
|
||||
import nats
|
||||
import msgpack
|
||||
|
||||
async def message_handler(msg: Msg) -> None:
|
||||
try:
|
||||
# Decode MessagePack
|
||||
data = msgpack.unpackb(msg.data, raw=False)
|
||||
|
||||
# Process
|
||||
result = await process(data)
|
||||
|
||||
# Reply if request-reply pattern
|
||||
if msg.reply:
|
||||
await msg.respond(msgpack.packb(result))
|
||||
|
||||
# Acknowledge for JetStream
|
||||
await msg.ack()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Handler error: {e}")
|
||||
# NAK for retry (JetStream)
|
||||
await msg.nak()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Kubernetes Manifest Conventions
|
||||
|
||||
### Labels
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
# Required
|
||||
app.kubernetes.io/name: chat-handler
|
||||
app.kubernetes.io/instance: chat-handler
|
||||
app.kubernetes.io/component: handler
|
||||
app.kubernetes.io/part-of: ai-platform
|
||||
|
||||
# Optional
|
||||
app.kubernetes.io/version: "1.0.0"
|
||||
app.kubernetes.io/managed-by: flux
|
||||
```
|
||||
|
||||
### Annotations
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
# Reloader for config changes
|
||||
reloader.stakater.com/auto: "true"
|
||||
|
||||
# Documentation
|
||||
description: "Handles chat messages via NATS"
|
||||
```
|
||||
|
||||
### Resource Requests
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 512Mi
|
||||
|
||||
# GPU workloads
|
||||
resources:
|
||||
limits:
|
||||
amd.com/gpu: 1 # AMD
|
||||
nvidia.com/gpu: 1 # NVIDIA
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: 8080
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Flux/GitOps Conventions
|
||||
|
||||
### Kustomization Structure
|
||||
|
||||
```yaml
|
||||
# ks.yaml - Flux Kustomization
|
||||
apiVersion: kustomize.toolkit.fluxcd.io/v1
|
||||
kind: Kustomization
|
||||
metadata:
|
||||
name: &app chat-handler
|
||||
namespace: flux-system
|
||||
spec:
|
||||
targetNamespace: ai-ml
|
||||
commonMetadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: *app
|
||||
path: ./kubernetes/apps/ai-ml/chat-handler/app
|
||||
prune: true
|
||||
sourceRef:
|
||||
kind: GitRepository
|
||||
name: flux-system
|
||||
wait: true
|
||||
interval: 30m
|
||||
retryInterval: 1m
|
||||
timeout: 5m
|
||||
```
|
||||
|
||||
### HelmRelease Structure
|
||||
|
||||
```yaml
|
||||
apiVersion: helm.toolkit.fluxcd.io/v2
|
||||
kind: HelmRelease
|
||||
metadata:
|
||||
name: milvus
|
||||
spec:
|
||||
interval: 30m
|
||||
chart:
|
||||
spec:
|
||||
chart: milvus
|
||||
version: 4.x.x
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: milvus
|
||||
namespace: flux-system
|
||||
values:
|
||||
# Values here
|
||||
```
|
||||
|
||||
### Secret References
|
||||
|
||||
```yaml
|
||||
# Never hardcode secrets
|
||||
env:
|
||||
- name: DATABASE_PASSWORD
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: postgres-credentials
|
||||
key: password
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## NATS Subject Conventions
|
||||
|
||||
### Hierarchy
|
||||
|
||||
```
|
||||
ai.{domain}.{scope}.{action}
|
||||
|
||||
Examples:
|
||||
ai.chat.user.{userId}.message # User chat message
|
||||
ai.chat.response.{requestId} # Chat response
|
||||
ai.voice.user.{userId}.request # Voice request
|
||||
ai.pipeline.trigger # Pipeline trigger
|
||||
```
|
||||
|
||||
### Wildcards
|
||||
|
||||
```
|
||||
ai.chat.> # All chat events
|
||||
ai.chat.user.*.message # All user messages
|
||||
ai.*.response.{id} # Any response type
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Git Conventions
|
||||
|
||||
### Commit Messages
|
||||
|
||||
```
|
||||
type(scope): subject
|
||||
|
||||
body (optional)
|
||||
|
||||
footer (optional)
|
||||
```
|
||||
|
||||
**Types:**
|
||||
- `feat`: New feature
|
||||
- `fix`: Bug fix
|
||||
- `docs`: Documentation
|
||||
- `style`: Formatting
|
||||
- `refactor`: Code restructuring
|
||||
- `test`: Tests
|
||||
- `chore`: Maintenance
|
||||
|
||||
**Examples:**
|
||||
```
|
||||
feat(chat-handler): add streaming response support
|
||||
fix(voice): handle empty audio gracefully
|
||||
docs(adr): add decision for MessagePack format
|
||||
```
|
||||
|
||||
### Branch Naming
|
||||
|
||||
```
|
||||
feature/short-description
|
||||
fix/issue-number-description
|
||||
docs/what-changed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration Conventions
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```python
|
||||
# Use pydantic-settings or similar
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
class Settings(BaseSettings):
|
||||
nats_url: str = "nats://localhost:4222"
|
||||
vllm_url: str = "http://localhost:8000"
|
||||
milvus_host: str = "localhost"
|
||||
milvus_port: int = 19530
|
||||
log_level: str = "INFO"
|
||||
|
||||
class Config:
|
||||
env_prefix = "" # No prefix
|
||||
```
|
||||
|
||||
### ConfigMaps
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: ai-services-config
|
||||
data:
|
||||
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
|
||||
# ... other non-sensitive config
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Documentation Conventions
|
||||
|
||||
### ADR Format
|
||||
|
||||
See [decisions/0000-template.md](decisions/0000-template.md)
|
||||
|
||||
### Code Comments
|
||||
|
||||
```python
|
||||
# Use docstrings for public functions
|
||||
async def query_rag(query: str) -> List[Dict]:
|
||||
"""
|
||||
Query the RAG system for relevant documents.
|
||||
|
||||
Args:
|
||||
query: The search query string
|
||||
|
||||
Returns:
|
||||
List of document chunks with scores
|
||||
|
||||
Raises:
|
||||
RAGQueryError: If the query fails
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
### README Files
|
||||
|
||||
Each application should have a README with:
|
||||
1. Purpose
|
||||
2. Configuration
|
||||
3. Deployment
|
||||
4. Local development
|
||||
5. API documentation (if applicable)
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
| Don't | Do Instead |
|
||||
|-------|------------|
|
||||
| `kubectl apply` directly | Commit to Git, let Flux deploy |
|
||||
| Hardcode secrets | Use External Secrets Operator |
|
||||
| Use `latest` image tags | Pin to specific versions |
|
||||
| Skip health checks | Always define liveness/readiness |
|
||||
| Ignore resource limits | Set appropriate requests/limits |
|
||||
| Use JSON for NATS messages | Use MessagePack (binary) |
|
||||
| Synchronous I/O in handlers | Use async/await |
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [TECH-STACK.md](TECH-STACK.md) - Technologies used
|
||||
- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
|
||||
- [decisions/](decisions/) - Why we made certain choices
|
||||
Reference in New Issue
Block a user