Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of current large language models in visual understanding, which rely heavily on deep sequential reasoning and are thus prone to cognitive biases and constrained exploration. To overcome this, we introduceโ€” for the first timeโ€”a parallel reasoning paradigm to multimodal vision tasks, proposing a parallelized divide-and-conquer inference framework. Our approach employs a visual partitioning strategy to decouple reasoning paths and incorporates Pa-Attention and LPRoPE mechanisms to ensure diversity and independence among these paths. Built upon a native multimodal parallel architecture based on vLLM, the framework efficiently supports this novel paradigm. Experiments demonstrate significant performance gains across multiple benchmarks, including V*, CountBench, RefCOCO, and HallusionBench, successfully extending the advantages of parallel reasoning into the visual domain.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.
Problem

Research questions and friction points this paper is trying to address.

parallel reasoning
visual comprehension
multimodal large language models
reasoning diversity
visual partitioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel reasoning
visual partitioning
Pa-Attention
LPRoPE
multimodal LLM
Haoran Xu
Haoran Xu
Zhejiang University
Embodied AIRoboticsComputer Vision3D Vision
H
Hongyu Wang
Hunan University
Jiaze Li
Jiaze Li
zhejiang university
MLLMFederated Learning
S
Shunpeng Chen
Independent Researcher
Z
Zizhao Tong
University of Chinese Academy of Sciences
J
Jianzhong Ju
MiLMPlus, Xiaomi Inc
Zhenbo Luo
Zhenbo Luo
XiaoMi
Vision Language ModelComputer Vision
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis