🤖 AI Summary
This work addresses the limited robustness of multimodal large language models (MLLMs) under viewpoint variations, which hinders their ability to generalize visual understanding to nearby perspectives. To tackle this challenge, the study introduces a novel approach that shifts viewpoint transformation from the pixel level to the Vision Transformer (ViT) image token level, proposing a reverse token warping method. This technique emulates the structural representation mechanism observed in human mental imagery, enhancing viewpoint invariance while preserving semantic consistency. Evaluated on the ViewBench benchmark, the proposed method significantly outperforms baseline strategies—including pixel-level warping, spatial fine-tuning, and generative warping—demonstrating more reliable visual reasoning across adjacent viewpoints.
📝 Abstract
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.