🤖 AI Summary
This work addresses the tendency of existing vision-language models to overlook fine-grained geometric cues in cross-view spatial reasoning, often relying excessively on linguistic priors. To mitigate this limitation, the authors propose a visual chain-of-thought mechanism that generates intermediate "thinking" images within a unified multimodal architecture to support reasoning. They further introduce a View Dropout training strategy that compels the model to depend on these visual thoughts for answering questions. After systematically evaluating three forms of visual chain-of-thought representations, the study finds that panoramic visual thoughts achieve the best trade-off between learnability and informativeness. The proposed approach demonstrates significant improvements in generalization across five real-world cross-domain benchmarks, validating the effectiveness of both the architectural design and the training strategy.
📝 Abstract
Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.