LLM API Daily — 2026-05-25
🚨 Breaking
Nothing today. Zero breaking changes across all scanned providers.
🗑️ Deprecations
None announced.
💰 Pricing
No pricing changes in today's brief.
🆕 New
llama.cpp b9305 — cmake build fix for the UI layer: adds -fPIC to the llama-ui static lib and renames the host-compiled embed helper. macOS Apple Silicon (arm64) and KleidiAI-enabled arm64 builds available. Actionable only if you build llama.cpp from source on macOS and hit UI-related link errors. Release
Lens (3.8B T2I) — new text-to-image model claiming competitive or better performance vs. SOTA models with >6B parameters across benchmarks, while requiring only ~19.3% of their training compute. No hosted API mentioned in the brief. Relevant if you benchmark open T2I models against fine-tuning budgets. Paper
SkillOpt — framework treating agent skill improvement as an optimization loop rather than one-shot generation or manual crafting. Claims reliable improvement under feedback vs. existing self-revision approaches. Worth a read if you maintain long-running agent pipelines where skill drift is a problem. Paper
StepAudio 2.5 — unified audio-language model targeting ASR and reasoning in a single foundation, positioning against specialized systems. Paper
charmbracelet/crush nightly — nightly build, sigstore-signed checksums. No functional changelog in the brief. Release
🌐 AI Landscape
Memory now ~two-thirds of AI chip component cost — Epoch AI analysis shows memory has grown to nearly two-thirds of AI chip component costs. Direct implication: inference cost curves are memory-bandwidth-bound, not compute-bound. Re-examine instance selection accordingly. epoch.ai · 338 HN points
AI washing accelerating — The Guardian reports firms scrambling to rebrand as "tech-focused" to capture the AI narrative. Useful signal for procurement due diligence: demand API documentation and uptime SLAs, not press releases. Guardian · 153 HN points
💡 Tip of the Day
The Epoch AI memory-cost finding is the one concrete data point worth acting on today: if you're sizing GPU instances for inference, re-run your cost model with memory bandwidth as the primary constraint rather than raw FLOPs. On most current workloads a memory-optimized instance will match or beat a compute-optimized one at equivalent token throughput — and the cost gap is only widening.