ray-serve

Author	SHA1	Message	Date
Billy D.	79dbaa6d2c	fix: add stop_token_ids and clamp max_tokens All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s Details - Add Llama 3 stop token IDs (128001, 128009) to SamplingParams as safety net for V1 engine max_tokens bug on ROCm/gfx1151 - Clamp max_tokens to min(requested, max_model_len) - Support DEFAULT_MAX_TOKENS env var (default 256) - Prevents runaway generation when V1 engine ignores max_tokens	2026-02-13 09:19:20 -05:00
Billy D.	96f7650b23	fix: respect VLLM_USE_TRITON_AWQ from runtime_env instead of hardcoding 0 All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 14s Details The previous code unconditionally set VLLM_USE_TRITON_AWQ=0, overriding the value from the RayService runtime_env env_vars. On gfx1151: - Triton AWQ kernels work (TRITON_AWQ=1) - C++ awq_dequantize op does NOT exist (TRITON_AWQ=0 → crash) Changed to os.environ.setdefault('VLLM_USE_TRITON_AWQ', '1') so the operator-configured value is preserved, defaulting to Triton AWQ.	2026-02-13 07:29:57 -05:00
Billy D.	f66de251eb	fix: add ENFORCE_EAGER env var to skip torch.compile on ROCm All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s Details torch._dynamo.exc.Unsupported crashes EngineCore during graph tracing of LlamaDecoderLayer on gfx1151. ENFORCE_EAGER=true bypasses torch.compile and CUDA graph capture entirely.	2026-02-13 06:56:29 -05:00
Billy D.	6a391147a6	minor: refactoring big changes. All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s Details	2026-02-12 18:47:50 -05:00
Billy D.	297b0d8ebd	fix: move mlflow import inside __init__ to avoid cloudpickle serialization failure All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s Details The strixhalo LLM worker uses py_executable which bypasses pip runtime_env. Module-level try/except still fails because cloudpickle on the head node resolves the real InferenceLogger class and serializes a module reference. Moving the import inside __init__ means it runs at actor construction time on the worker, where ImportError is caught gracefully.	2026-02-12 07:06:49 -05:00
Billy D.	15e4b8afa3	fix: make mlflow_logger import optional with no-op fallback All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 11s Details The strixhalo LLM worker uses py_executable pointing to the Docker image venv which doesn't have the updated ray-serve-apps package. Wrap all InferenceLogger imports in try/except and guard usage with None checks so apps degrade gracefully without MLflow logging.	2026-02-12 07:01:17 -05:00
Billy D.	7ec2107e0c	feat: add MLflow inference logging to all Ray Serve apps All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s Details - Add mlflow_logger.py: lightweight REST-based MLflow logger (no mlflow dep) - Instrument serve_llm.py with latency, token counts, tokens/sec metrics - Instrument serve_embeddings.py with latency, batch_size, total_tokens - Instrument serve_whisper.py with latency, audio_duration, realtime_factor - Instrument serve_tts.py with latency, audio_duration, text_chars - Instrument serve_reranker.py with latency, num_pairs, top_k	2026-02-12 06:14:30 -05:00
Billy D.	2edafc33c0	async vllm is better. All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 1m3s Details	2026-02-11 06:05:50 -05:00
Billy D.	c9d7a2b5b7	fixing coqui Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 20s Details	2026-02-09 09:14:30 -05:00
Billy D.	4549295a07	trigger: test package upload after gitea temp fix All checks were successful Build and Publish ray-serve-apps / build-and-publish (push) Successful in 5m16s Details	2026-02-03 20:12:30 -05:00
Billy D.	665416bb0e	chore: trigger build with repo secrets Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 50s Details	2026-02-03 19:33:45 -05:00
Billy D.	e853b805ae	chore: trigger pipeline with org-level runner Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 2m50s Details	2026-02-03 19:22:34 -05:00
Billy D.	9bc40cfd20	chore: trigger rebuild after gitea storage migration Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 46s Details	2026-02-03 16:07:27 -05:00
Billy D.	4a560f9b9e	chore: retrigger pipeline after runner restart Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 12m51s Details	2026-02-03 15:49:43 -05:00
Billy D.	baf86e5609	ci: semver based on commit message keywords Some checks failed Build and Publish ray-serve-apps / build-and-publish (push) Failing after 14m11s Details - 'major' in message -> increment major, reset minor/patch - 'minor' or 'feature' -> increment minor, reset patch - 'bug', 'chore', anything else -> increment patch - Release number from git rev-list commit count - Format: major.minor.patch+release	2026-02-03 15:25:15 -05:00
Billy D.	3fb6d8f9c2	chore: trigger rebuild after S3 storage migration	2026-02-03 15:12:54 -05:00
Billy D.	8ef914ec12	feat: initial ray-serve-apps PyPI package Some checks failed Build and Publish ray-serve-apps / lint (push) Failing after 11m2s Details Build and Publish ray-serve-apps / publish (push) Has been cancelled Details Implements ADR-0024: Ray Repository Structure - Ray Serve deployments for GPU-shared AI inference - Published as PyPI package for dynamic code loading - Deployments: LLM, embeddings, reranker, whisper, TTS - CI/CD workflow publishes to Gitea PyPI on push to main Extracted from kuberay-images repo per ADR-0024	2026-02-03 07:03:39 -05:00
Billy Davies	eac8f27f2e	Initial commit	2026-02-03 11:59:56 +00:00

18 Commits