🤖 AI Summary
This work addresses the significant degradation in comprehension performance of unified multimodal models under real-world image corruptions—such as blur, noise, compression artifacts, and low-light conditions—and the inability of existing approaches to effectively coordinate their generative and reasoning capabilities. To overcome this, we propose CLEAR, a three-stage framework that first establishes a “generate-then-answer” reasoning paradigm via degradation-aware supervised fine-tuning, then introduces a latent representation bridging mechanism to replace the conventional decode-and-re-encode pathway for improved efficiency, and finally employs interleaved GRPO-based reinforcement learning to jointly optimize visual generation and textual reasoning. CLEAR is the first method to explicitly align generation and understanding within a unified multimodal model, enhancing intermediate visual representations without requiring pixel-level reconstruction supervision. Evaluated on MMD-Bench—a benchmark spanning six multimodal datasets and three levels of degradation—CLEAR substantially improves robustness to corrupted inputs while maintaining strong performance on clean images.
📝 Abstract
Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.