ray-serve/ray_serve/serve_llm.py at fd3234b79c1749a51807ed643a8c0a9de68cf1eb

Files

Build and Publish ray-serve-apps / build-and-publish (push) Successful in 12s

Details

fix: add stop_token_ids and clamp max_tokens

- Add Llama 3 stop token IDs (128001, 128009) to SamplingParams
  as safety net for V1 engine max_tokens bug on ROCm/gfx1151
- Clamp max_tokens to min(requested, max_model_len)
- Support DEFAULT_MAX_TOKENS env var (default 256)
- Prevents runaway generation when V1 engine ignores max_tokens

2026-02-13 09:19:20 -05:00

9.3 KiB

Raw Blame History

View Raw

9.3 KiB Raw Blame History

9.3 KiB

Raw Blame History