Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit insufficient robustness under severe real-world visual degradations. Method: We propose the first structured reasoning framework that explicitly models degradation factors, introducing a novel “degradation parameters → perceptual impact → semantic reasoning” chained modeling mechanism. Our approach integrates degradation-aware fine-tuning, reward-driven parameter perception, and dynamic adaptation of reasoning depth. We construct the first 11K-chain-annotated dataset covering four stages of realistic visual degradation and introduce structured chain-of-thought prompting with multi-stage synthetic degradation modeling. Contribution/Results: Evaluated on multiple benchmarks—including R-Bench, MMMB, MMStar, and RealWorldQA—our method consistently outperforms both general-purpose and existing robust MLLMs, achieving state-of-the-art interference resilience under strong visual degradation conditions.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.
Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable performance of multimodal models under extreme visual degradations.
Overcomes limitations of implicit training by introducing explicit degradation-aware reasoning chains.
Enhances interpretability and robustness through structured reasoning and dynamic depth scaling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Degradation-aware reasoning chains for explicit modeling
Reward-driven alignment for accurate degradation perception
Dynamic reasoning depth scaling adapted to degradation intensity
🔎 Similar Papers
J
Jiaqi Tang
Hong Kong University of Science and Technology
J
Jianmin Chen
Northwestern Polytechnical University
W
Wei Wei
Northwestern Polytechnical University
Xiaogang Xu
Xiaogang Xu
CUHK
Large ModelMulti-Modality AIAIGCGenerative PhotographyAI Security
Runtao Liu
Runtao Liu
Hong Kong University of Science and Technology
computer visionai safetyRLHFreasoning
X
Xiangyu Wu
Nanjing University of Science and Technology
Qipeng Xie
Qipeng Xie
Unknown affiliation
J
Jiafei Wu
University of Hong Kong
L
Lei Zhang
Northwestern Polytechnical University
Qifeng Chen
Qifeng Chen
HKUST
Computational PhotographyImage SynthesisGenerative AIAutonomous DrivingEmbodied AI