🤖 AI Summary
This work addresses three core challenges in low-level vision: poor generalization across multi-task settings, task ambiguity, and degradation of detail recovery due to interference from generative tasks. We propose the first unified multimodal multi-task framework supporting over 100 subtasks—including image restoration, enhancement, weak-semantic dense prediction, and stylization. Methodologically, we design a dual-branch architecture with separated text and visual prompt encoding, introduce shallow-feature collaborative modulation, and build a resolution-agnostic generative prior based on the Diffusion Transformer. We further uncover, for the first time, the detrimental mechanism by which high-order generative tasks impair detail-sensitive restoration—enabling fine-grained fidelity preservation and task-decoupled generalization. Our method achieves state-of-the-art performance at 1K resolution, demonstrates significantly improved cross-task generalization, comprehensively covers four major categories of low-level vision tasks, and sets new benchmarks in detail retention and reconstruction fidelity.
📝 Abstract
We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.