Universal YOCO for Efficient Depth Scaling

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard Transformers suffer from computational inefficiency during inference and quadratic growth of key-value (KV) cache with model depth. This work proposes YOCO-U, the first architecture integrating the YOCO decoder-decoder design with parameter-sharing recurrent computation. By employing a universal self-decoder that iteratively refines representations within shallow, efficient attention layers, YOCO-U achieves substantially deeper effective representation depth and improved token utilization—while maintaining a constant global KV cache size and linear prefill cost. Experimental results demonstrate strong performance across both general and long-context benchmarks, validating the efficacy of combining efficient attention mechanisms with recurrence for scalable large language models.
📝 Abstract
The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
computational overhead
KV cache
inference efficiency
model depth
Innovation

Methods, ideas, or system contributions that make the work stand out.

YOCO-U
recursive computation
efficient attention
KV cache optimization
depth scaling
🔎 Similar Papers
No similar papers found.