Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding
Ben Dickson February 23, 2026 Image credit: VentureBeat with Nano Banana ProAs agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model’s weights.Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model’s existing architecture.The limits of next-token predictionNext-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially…