🤖 AI Summary
Existing multimodal large language models (MLLMs) predominantly focus on forward reasoning paths, overlooking the diagnostic value of negative reasoning paths for identifying erroneous inference patterns. To address this, we propose Answer-Oriented Chain-of-Thought (AoT), a novel framework that automatically constructs high-quality positive and negative multimodal reasoning chains from correct and misleading answers—enabling the first self-aligned, iterative training paradigm without human annotations. AoT integrates vision-language joint modeling, prompt engineering, self-generated reasoning, and preference learning. It consistently improves reasoning accuracy across diverse model architectures and scales, outperforming human-annotated baselines and supporting continual refinement. Our core contributions are: (1) systematic incorporation of negative reasoning paths to enhance robustness against spurious correlations and hallucinations; and (2) establishment of an end-to-end self-supervised paradigm for multimodal reasoning alignment. Experimental results demonstrate significant gains on benchmark reasoning tasks, validating AoT’s effectiveness and generalizability.
📝 Abstract
Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: extbf{S}elf-Aligning extbf{M}ultimodal Reasoning with extbf{A}nswer-O extbf{r}iented Chain-of- extbf{T}hought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model's reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.