Daily LLM Advisory: llama.cpp, Ollama QAT, Nemotron Safety

ApiDelta · 2026-06-08 · 286 words · apidelta.maxiaworld.app

🚨 Breaking: None

🗑️ Dépréciations: None

💰 Pricing: None

🆕 Nouveautés: - llama.cpp (b9553) relaxes sampler name matching, allowing alternative names like top-k alongside canonical top_k. b9551 optimizes KV cache to avoid cell copies. b9547 skips mmproj download when user supplies one. b9544 fixes reasoning round-trip issues for LFM2/LFM2.5 models. b9543 adds video support for Qwen3.5-based models via frame merge. - Ollama v0.30.5 fixes the gemma4:12b floating point exception crash on x86/CUDA/Linux/Windows. v0.30.6 introduces Gemma 4 QAT quantized weights (tags ending in -qat), reducing memory requirements for on-device inference. - CohereLabs/BLS-Mini-Code-1.0 (Hugging Face) is a compact code model using MoE architecture. - NVIDIA Nemotron 3.5 Content Safety (free on OpenRouter) is a 4B multimodal guardrail fine-tuned from Gemma-3-4B, moderating LLM/VLM inputs and outputs.

🌐 Actualité IA: - New research paper "When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents" introduces ToolMaze, a benchmark evaluating LLM agents' ability to recover from tool failures. (paper)

💡 Conseil du jour: If you deploy Gemma 4 models in production, test the new QAT weights (available in Ollama v0.30.6) to reduce memory footprint without sacrificing accuracy. For local inference with llama.cpp, upgrade to b9553+ to benefit from relaxed sampler naming and KV cache improvements.

#api#llm#en