Files
ray-serve/ray_serve/serve_llm.py
Billy D. 79dbaa6d2c
All checks were successful
Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s
fix: add stop_token_ids and clamp max_tokens
- Add Llama 3 stop token IDs (128001, 128009) to SamplingParams
  as safety net for V1 engine max_tokens bug on ROCm/gfx1151
- Clamp max_tokens to min(requested, max_model_len)
- Support DEFAULT_MAX_TOKENS env var (default 256)
- Prevents runaway generation when V1 engine ignores max_tokens
2026-02-13 09:19:20 -05:00

9.3 KiB