๐ค AI Summary
This work addresses the limitations of current large language models in visual understanding, which rely heavily on deep sequential reasoning and are thus prone to cognitive biases and constrained exploration. To overcome this, we introduceโ for the first timeโa parallel reasoning paradigm to multimodal vision tasks, proposing a parallelized divide-and-conquer inference framework. Our approach employs a visual partitioning strategy to decouple reasoning paths and incorporates Pa-Attention and LPRoPE mechanisms to ensure diversity and independence among these paths. Built upon a native multimodal parallel architecture based on vLLM, the framework efficiently supports this novel paradigm. Experiments demonstrate significant performance gains across multiple benchmarks, including V*, CountBench, RefCOCO, and HallusionBench, successfully extending the advantages of parallel reasoning into the visual domain.
๐ Abstract
Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.