LLM API Daily 2026-05-27: sglang DeepSeek V4 stability patch, TensorRT-LLM Gemma4/Qwen3.5, crush Bedrock fix

🚨 Breaking

Nothing to report today.

🗑️ Deprecations

Nothing to report today.

💰 Pricing

Nothing to report today.

🆕 New

sglang v0.5.12.post1 — Stability patch on top of v0.5.12, cherry-picking 12 fixes onto the release branch. Highest-priority fix: DeepSeek V4-Pro emits garbled text during single-token decode on B200/B300 hardware — root cause is the deep_gemm UE8M0 scale-packing path, fixed by ceiling activation scales before packing. EAGLE/MTP speculative decoding also patched. If you run DSV4 inference in production, this is a mandatory upgrade; garbled output is silent — latency metrics won't catch it, only output quality audits will. → https://github.com/sgl-project/sglang/releases/tag/v0.5.12.post1

TensorRT-LLM v1.3.0rc16 — RC adds Gemma4 multimodal support with native vision and audio towers, Qwen3.5 MTP, Qwen3.6-27B-FP8, EXAONE-4.5, and Laguna. DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE switch to sharding-IR canonical models. API layer refactored (details in release notes). Still RC — validate in staging before promoting. → https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v1.3.0rc16

llama.cpp b9330 — Fixes a backend selection bug for nemotron-h: ffn_latent tensors are declared as GGML_OP_MUL in LLM_TENSOR_INFOS but are fed through ggml_mul_mat. The buft probe was testing an elementwise MUL on a q8_0 weight, which returned true unconditionally — causing incorrect backend assignment. If you serve nemotron-h with q8_0 quantization, upgrade. → https://github.com/ggml-org/llama.cpp/releases/tag/b9330

llama.cpp b9329 — CUDA backend gains a fast Walsh-Hadamard transform. Relevant if your model workload hits FWHT ops. → https://github.com/ggml-org/llama.cpp/releases/tag/b9329

codex 0.134.0 — --profile is now the primary profile selector across CLI, TUI permissions, and sandbox flows. Legacy profile configs are explicitly rejected with migration guidance. Audit any automation scripts using legacy profile configuration before upgrading. → https://github.com/openai/codex/releases/tag/rust-v0.134.0

crush v0.73.0 — Bedrock region routing behavior changed. Previously, AWS_REGION or AWS_DEFAULT_REGION could instruct Crush to reach Bedrock in a region where most models are unavailable. New behavior overrides this to ensure model availability by default. If your deployment sets these env vars explicitly for Bedrock routing, verify the new behavior is correct for your setup. → https://github.com/charmbracelet/crush/releases/tag/v0.73.0

💡 Today's Action

If sglang is in your stack with DeepSeek V4 on B200/B300 hardware: pin to v0.5.12.post1 now. The single-token decode garbling is a data-corruption class bug — it will pass through your error rate monitors undetected.