Commit Graph

7 Commits

Author SHA1 Message Date
f66de251eb fix: add ENFORCE_EAGER env var to skip torch.compile on ROCm
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
torch._dynamo.exc.Unsupported crashes EngineCore during graph tracing
of LlamaDecoderLayer on gfx1151.  ENFORCE_EAGER=true bypasses
torch.compile and CUDA graph capture entirely.
2026-02-13 06:56:29 -05:00
6a391147a6 minor: refactoring big changes.
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
2026-02-12 18:47:50 -05:00
297b0d8ebd fix: move mlflow import inside __init__ to avoid cloudpickle serialization failure
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s
The strixhalo LLM worker uses py_executable which bypasses pip runtime_env.
Module-level try/except still fails because cloudpickle on the head node
resolves the real InferenceLogger class and serializes a module reference.
Moving the import inside __init__ means it runs at actor construction time
on the worker, where ImportError is caught gracefully.
2026-02-12 07:06:49 -05:00
15e4b8afa3 fix: make mlflow_logger import optional with no-op fallback
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 11s
The strixhalo LLM worker uses py_executable pointing to the Docker
image venv which doesn't have the updated ray-serve-apps package.
Wrap all InferenceLogger imports in try/except and guard usage with
None checks so apps degrade gracefully without MLflow logging.
2026-02-12 07:01:17 -05:00
7ec2107e0c feat: add MLflow inference logging to all Ray Serve apps
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 16s
- Add mlflow_logger.py: lightweight REST-based MLflow logger (no mlflow dep)
- Instrument serve_llm.py with latency, token counts, tokens/sec metrics
- Instrument serve_embeddings.py with latency, batch_size, total_tokens
- Instrument serve_whisper.py with latency, audio_duration, realtime_factor
- Instrument serve_tts.py with latency, audio_duration, text_chars
- Instrument serve_reranker.py with latency, num_pairs, top_k
2026-02-12 06:14:30 -05:00
2edafc33c0 async vllm is better.
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 1m3s
2026-02-11 06:05:50 -05:00
8ef914ec12 feat: initial ray-serve-apps PyPI package
Some checks failed
Build and Publish ray-serve-apps / lint (push) Failing after 11m2s
Build and Publish ray-serve-apps / publish (push) Has been cancelled
Implements ADR-0024: Ray Repository Structure

- Ray Serve deployments for GPU-shared AI inference
- Published as PyPI package for dynamic code loading
- Deployments: LLM, embeddings, reranker, whisper, TTS
- CI/CD workflow publishes to Gitea PyPI on push to main

Extracted from kuberay-images repo per ADR-0024
2026-02-03 07:03:39 -05:00