๐ค AI Summary
This work addresses the challenge that existing unified multimodal models struggle to dynamically coordinate comprehension and generation capabilities during inference, often constrained by fixed or tightly coupled coordination strategies. The authors propose UniPath, a novel framework that reveals, for the first time, the existence of diverse coordination pathways in multimodal reasoningโsuch as direct answering, textual reasoning, visual thought construction, and hypothesis exploration. By training path-conditioned executors through role-aligned trajectory learning and integrating a lightweight planner for input-adaptive path selection, UniPath achieves both high performance and interpretability within a unified architecture. Experiments demonstrate significant improvements over fixed coordination strategies across a range of multimodal tasks, underscoring the critical role of coordination pathway diversity in effective multimodal reasoning.
๐ Abstract
Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning. Existing coordination approaches either perform coupling during training, without explicit inference-time coordination, or impose a fixed coordination pattern for all inputs. In this work, we show that multimodal tasks exhibit substantial coordination-path diversity: different inputs favor different coordination paths. This suggests that exploiting such diversity is key to improving performance. We propose UniPath, a framework for adaptively modeling and exploiting coordination-path diversity. Instead of enforcing a single coordination pattern, we represent task solving as the selection and execution of a path, ranging from direct answering to textual inference, visual-thought construction, and hypothesis-based exploration. We construct role-aligned trajectories to train a path-conditioned executor and introduce a lightweight planner mechanism to enable input-dependent path selection. Experiments show that leveraging coordination-path diversity improves performance over fixed coordination strategies while providing interpretable intermediate behaviors. The code is available at:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath.