CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the significant degradation in comprehension performance of unified multimodal models under real-world image corruptions—such as blur, noise, compression artifacts, and low-light conditions—and the inability of existing approaches to effectively coordinate their generative and reasoning capabilities. To overcome this, we propose CLEAR, a three-stage framework that first establishes a “generate-then-answer” reasoning paradigm via degradation-aware supervised fine-tuning, then introduces a latent representation bridging mechanism to replace the conventional decode-and-re-encode pathway for improved efficiency, and finally employs interleaved GRPO-based reinforcement learning to jointly optimize visual generation and textual reasoning. CLEAR is the first method to explicitly align generation and understanding within a unified multimodal model, enhancing intermediate visual representations without requiring pixel-level reconstruction supervision. Evaluated on MMD-Bench—a benchmark spanning six multimodal datasets and three levels of degradation—CLEAR substantially improves robustness to corrupted inputs while maintaining strong performance on clean images.

Technology Category

Application Category

📝 Abstract

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

Problem

Research questions and friction points this paper is trying to address.

image degradation

multimodal understanding

generative capacity

unified multimodal models

visual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Representation Bridge

Interleaved GRPO

generate-then-answer reasoning