Understanding Degradation with Vision Language Model

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that vision-language models struggle to comprehend the physical mechanisms underlying image degradation. To this end, it formalizes degradation understanding as a hierarchical structured prediction task that jointly predicts degradation types, associated parameter keys, and their continuous physical values. The authors propose DU-VLM, a multimodal chain-of-thought architecture trained via supervised fine-tuning and reinforcement learning with structured rewards, augmented by quantized grid constraints to mitigate value-space errors. A new physically annotated dataset, DU-110k, comprising 110,000 samples, is introduced to support this framework. Experiments demonstrate that DU-VLM significantly outperforms general-purpose baselines in both accuracy and robustness, exhibits strong out-of-distribution generalization, and functions effectively as a zero-shot controller to guide diffusion models for high-fidelity image restoration.

Technology Category

Application Category

📝 Abstract
Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
Problem

Research questions and friction points this paper is trying to address.

image degradation
vision-language models
physical parameters
structured prediction
degradation understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Structured Prediction
Autoregressive Modeling
Zero-shot Image Restoration
Degradation Understanding
🔎 Similar Papers
No similar papers found.
G
Guan-Wei Lan
School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University; Shanghai AI Laboratory
C
Chenyi Liao
Honors College, Northwestern Polytechnical University
Yuqi Yang
Yuqi Yang
Nankai University
Computer VisionSemantic Segmentation
Qianli Ma
Qianli Ma
Shanghai Jiao Tong University
Deep LearningGenerative AILLMsMLLMs
Z
Zhigang Wang
Shanghai AI Laboratory
Dong Wang
Dong Wang
Shanghai AI Laboratory
Embodied AIRobot VisionRobot Fundation Model
Bin Zhao
Bin Zhao
Northwestern Polytechnical University, Shanghai AI Laboratory
Computer VisionEmbodied Artificial Intelligence
X
Xuelong Li
TeleAI, China Telecom; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University