Commit Graph

25 Commits

Author SHA1 Message Date
194a431e8c feat(tts): add streaming SSE endpoint and sentence splitter
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 6m53s
- Add POST /stream SSE endpoint that splits text into sentences,
  synthesizes each individually, and streams base64 WAV via SSE events
- Add _split_sentences() helper for robust sentence boundary detection
- Enables progressive audio playback for lower time-to-first-audio
2026-02-22 10:45:58 -05:00
0fb325fa05 feat: FastAPI ingress for TTS — GET /api/tts returns raw WAV
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 2m5s
- Add FastAPI ingress to TTSDeployment with two routes:
  POST / — JSON API with base64 audio (backward compat)
  GET /api/tts?text=&language_id= — raw WAV bytes (zero overhead)
- GET /speakers endpoint for speaker listing
- Properly uses _fastapi naming to avoid collision with Ray binding
- app = TTSDeployment.bind() for rayservice.yaml compatibility
2026-02-21 12:49:44 -05:00
59655e3dcf feat: add SSE streaming support to LLM endpoint
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 2m9s
2026-02-20 16:52:08 -05:00
a973768aee fixing serve-llm stuff.
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
2026-02-18 07:30:00 -05:00
969e93cdd4 chore: trigger rebuild for new package registry
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 11s
2026-02-17 08:30:02 -05:00
fd3234b79c chore: trigger rebuild for new package registry
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 13s
2026-02-17 08:29:35 -05:00
6f8b3241de chore: add Renovate config for automated dependency updates
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 36s
Ref: ADR-0057
2026-02-13 15:34:08 -05:00
79dbaa6d2c fix: add stop_token_ids and clamp max_tokens
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
- Add Llama 3 stop token IDs (128001, 128009) to SamplingParams
  as safety net for V1 engine max_tokens bug on ROCm/gfx1151
- Clamp max_tokens to min(requested, max_model_len)
- Support DEFAULT_MAX_TOKENS env var (default 256)
- Prevents runaway generation when V1 engine ignores max_tokens
2026-02-13 09:19:20 -05:00
96f7650b23 fix: respect VLLM_USE_TRITON_AWQ from runtime_env instead of hardcoding 0
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 14s
The previous code unconditionally set VLLM_USE_TRITON_AWQ=0, overriding
the value from the RayService runtime_env env_vars.  On gfx1151:
- Triton AWQ kernels work (TRITON_AWQ=1)
- C++ awq_dequantize op does NOT exist (TRITON_AWQ=0 → crash)

Changed to os.environ.setdefault('VLLM_USE_TRITON_AWQ', '1') so the
operator-configured value is preserved, defaulting to Triton AWQ.
2026-02-13 07:29:57 -05:00
f66de251eb fix: add ENFORCE_EAGER env var to skip torch.compile on ROCm
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
torch._dynamo.exc.Unsupported crashes EngineCore during graph tracing
of LlamaDecoderLayer on gfx1151.  ENFORCE_EAGER=true bypasses
torch.compile and CUDA graph capture entirely.
2026-02-13 06:56:29 -05:00
6a391147a6 minor: refactoring big changes.
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
2026-02-12 18:47:50 -05:00
297b0d8ebd fix: move mlflow import inside __init__ to avoid cloudpickle serialization failure
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s
The strixhalo LLM worker uses py_executable which bypasses pip runtime_env.
Module-level try/except still fails because cloudpickle on the head node
resolves the real InferenceLogger class and serializes a module reference.
Moving the import inside __init__ means it runs at actor construction time
on the worker, where ImportError is caught gracefully.
2026-02-12 07:06:49 -05:00
15e4b8afa3 fix: make mlflow_logger import optional with no-op fallback
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 11s
The strixhalo LLM worker uses py_executable pointing to the Docker
image venv which doesn't have the updated ray-serve-apps package.
Wrap all InferenceLogger imports in try/except and guard usage with
None checks so apps degrade gracefully without MLflow logging.
2026-02-12 07:01:17 -05:00
7ec2107e0c feat: add MLflow inference logging to all Ray Serve apps
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s
- Add mlflow_logger.py: lightweight REST-based MLflow logger (no mlflow dep)
- Instrument serve_llm.py with latency, token counts, tokens/sec metrics
- Instrument serve_embeddings.py with latency, batch_size, total_tokens
- Instrument serve_whisper.py with latency, audio_duration, realtime_factor
- Instrument serve_tts.py with latency, audio_duration, text_chars
- Instrument serve_reranker.py with latency, num_pairs, top_k
2026-02-12 06:14:30 -05:00
2edafc33c0 async vllm is better.
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 1m3s
2026-02-11 06:05:50 -05:00
c9d7a2b5b7 fixing coqui
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 20s
2026-02-09 09:14:30 -05:00
4549295a07 trigger: test package upload after gitea temp fix
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 5m16s
2026-02-03 20:12:30 -05:00
665416bb0e chore: trigger build with repo secrets
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 50s
2026-02-03 19:33:45 -05:00
e853b805ae chore: trigger pipeline with org-level runner
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 2m50s
2026-02-03 19:22:34 -05:00
9bc40cfd20 chore: trigger rebuild after gitea storage migration
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 46s
2026-02-03 16:07:27 -05:00
4a560f9b9e chore: retrigger pipeline after runner restart
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 12m51s
2026-02-03 15:49:43 -05:00
baf86e5609 ci: semver based on commit message keywords
Some checks failed
Build and Publish ray-serve-apps / build-and-publish (push) Failing after 14m11s
- 'major' in message -> increment major, reset minor/patch
- 'minor' or 'feature' -> increment minor, reset patch
- 'bug', 'chore', anything else -> increment patch
- Release number from git rev-list commit count
- Format: major.minor.patch+release
2026-02-03 15:25:15 -05:00
3fb6d8f9c2 chore: trigger rebuild after S3 storage migration 2026-02-03 15:12:54 -05:00
8ef914ec12 feat: initial ray-serve-apps PyPI package
Some checks failed
Build and Publish ray-serve-apps / lint (push) Failing after 11m2s
Build and Publish ray-serve-apps / publish (push) Has been cancelled
Implements ADR-0024: Ray Repository Structure

- Ray Serve deployments for GPU-shared AI inference
- Published as PyPI package for dynamic code loading
- Deployments: LLM, embeddings, reranker, whisper, TTS
- CI/CD workflow publishes to Gitea PyPI on push to main

Extracted from kuberay-images repo per ADR-0024
2026-02-03 07:03:39 -05:00
eac8f27f2e Initial commit 2026-02-03 11:59:56 +00:00