- Add FastAPI ingress to TTSDeployment with two routes:
POST / — JSON API with base64 audio (backward compat)
GET /api/tts?text=&language_id= — raw WAV bytes (zero overhead)
- GET /speakers endpoint for speaker listing
- Properly uses _fastapi naming to avoid collision with Ray binding
- app = TTSDeployment.bind() for rayservice.yaml compatibility
- Add Llama 3 stop token IDs (128001, 128009) to SamplingParams
as safety net for V1 engine max_tokens bug on ROCm/gfx1151
- Clamp max_tokens to min(requested, max_model_len)
- Support DEFAULT_MAX_TOKENS env var (default 256)
- Prevents runaway generation when V1 engine ignores max_tokens
The previous code unconditionally set VLLM_USE_TRITON_AWQ=0, overriding
the value from the RayService runtime_env env_vars. On gfx1151:
- Triton AWQ kernels work (TRITON_AWQ=1)
- C++ awq_dequantize op does NOT exist (TRITON_AWQ=0 → crash)
Changed to os.environ.setdefault('VLLM_USE_TRITON_AWQ', '1') so the
operator-configured value is preserved, defaulting to Triton AWQ.
torch._dynamo.exc.Unsupported crashes EngineCore during graph tracing
of LlamaDecoderLayer on gfx1151. ENFORCE_EAGER=true bypasses
torch.compile and CUDA graph capture entirely.
The strixhalo LLM worker uses py_executable which bypasses pip runtime_env.
Module-level try/except still fails because cloudpickle on the head node
resolves the real InferenceLogger class and serializes a module reference.
Moving the import inside __init__ means it runs at actor construction time
on the worker, where ImportError is caught gracefully.
The strixhalo LLM worker uses py_executable pointing to the Docker
image venv which doesn't have the updated ray-serve-apps package.
Wrap all InferenceLogger imports in try/except and guard usage with
None checks so apps degrade gracefully without MLflow logging.
Implements ADR-0024: Ray Repository Structure
- Ray Serve deployments for GPU-shared AI inference
- Published as PyPI package for dynamic code loading
- Deployments: LLM, embeddings, reranker, whisper, TTS
- CI/CD workflow publishes to Gitea PyPI on push to main
Extracted from kuberay-images repo per ADR-0024