llama.cpp buft probe fix for nemotron-h, Cline adds DeepSeek V4 + GPT-5.5, CUDA WHT — 2026-05-26

ApiDelta · 2026-05-26 · 412 words · apidelta.maxiaworld.app

2026-05-26 · No breaking changes. Clean sweep on deprecations and pricing.


🚨 Breaking

None today.


🗑️ Dépréciations

None announced.


💰 Pricing

No pricing changes in today's brief.


🆕 Nouveautés

llama.cpp b9330 — Silent correctness bug fixed for nemotron-h. ffn_latent_down/ffn_latent_up were declared as GGML_OP_MUL in LLM_TENSOR_INFOS, but nemotron-h routes them through ggml_mul_mat at runtime. The backend buffer probe tested the declared op — GGML_OP_MUL — which previously returned true unconditionally on q8_0 weights, silently assigning the wrong backend. Now correctly tagged MUL_MAT. If you run nemotron-h locally with quantized weights, this is a silent-correctness fix — upgrade. → b9330

llama.cpp b9329 — Fast Walsh-Hadamard transform added for CUDA, with unrolls and warp-size-64 tuning. Pure throughput win for ops that use it; no API surface change. → b9329

Cline v3.85.0 — Three new model families added: GPT-5.5 on SAP AI Core, DeepSeek V4 Flash and Pro, Gemini 3.5 Flash (both Gemini and Vertex providers). Also fixes Vertex AI global endpoint handling for Claude models — if you route Claude through the Vertex global endpoint, this patch is worth pulling now. → v3.85.0

browser-use 0.12.9 — Session ID is now passed through to judge LLM calls (improves traceability in multi-session agent runs). New-tab pages no longer trigger spurious screenshots. → 0.12.9


🌐 Actualité IA

DVAO (HF Papers): Extends Group Relative Policy Optimization with dynamic variance-adaptive advantage weighting for multi-reward RL settings — relevant if you're blending multiple reward signals in RLHF pipelines. → 2605.25604

ParaVT (HF Papers): Addresses the sequential-tool-call bottleneck in video-agent RL — enables parallel tool dispatch (multiple tools per turn) rather than one per turn. Watch if you're building multi-tool agentic systems. → 2605.20342

HN trending — "Using AI to write better code more slowly" hit 295 points and 117 comments. Community debate on deliberate vs. velocity-first AI coding workflows — worth a skim if your team is calibrating AI-assist norms. → nolanlawson.com


💡 Conseil du jour

If nemotron-h is in your local inference stack: update llama.cpp to b9330 before your next run. The buft probe bug produced no visible error — it silently assigned q8_0 weight tensors to the wrong backend. Wrong backend = wrong numerics, not a crash.

#llm#api#llama.cpp#cline#deepseek#cuda#en#rl