LoopViT: Scaling Visual ARC with Looped Transformers

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work proposes LoopViT, a recurrent Vision Transformer that addresses the limitations of feedforward architectures in modeling the iterative and algorithmic nature of human-like inductive reasoning. By employing a weight-sharing recurrent mechanism, LoopViT decouples reasoning depth from model capacity, enabling deeper inference while maintaining parameter efficiency. Its Hybrid Block integrates local convolution with global attention to construct chain-of-thought-like reasoning pathways. Furthermore, an adaptive computation strategy is introduced through a parameter-free dynamic exit mechanism based on prediction entropy. On the ARC-AGI-1 benchmark, LoopViT achieves 65.8% accuracy with only 18M parameters, outperforming an ensemble model with 73M parameters and demonstrating substantial gains in both reasoning efficiency and performance.

Technology Category

Application Category

📝 Abstract

Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes"into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

ARC-AGI

iterative computation

algorithmic induction

reasoning depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Transformer

Weight-tied Recurrence

Dynamic Exit