Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in low-level vision: poor generalization across multi-task settings, task ambiguity, and degradation of detail recovery due to interference from generative tasks. We propose the first unified multimodal multi-task framework supporting over 100 subtasks—including image restoration, enhancement, weak-semantic dense prediction, and stylization. Methodologically, we design a dual-branch architecture with separated text and visual prompt encoding, introduce shallow-feature collaborative modulation, and build a resolution-agnostic generative prior based on the Diffusion Transformer. We further uncover, for the first time, the detrimental mechanism by which high-order generative tasks impair detail-sensitive restoration—enabling fine-grained fidelity preservation and task-decoupled generalization. Our method achieves state-of-the-art performance at 1K resolution, demonstrates significantly improved cross-task generalization, comprehensively covers four major categories of low-level vision tasks, and sets new benchmarks in detail retention and reconstruction fidelity.

Technology Category

Application Category

📝 Abstract
We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions -- achieving optimal performance at 1K resolution -- while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for 100+ low-level vision tasks
Leverages text and visual prompts for interaction
Balances detail preservation and task generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework for 100+ low-level vision tasks
Text and visual prompts enable flexible interactions
DiT-based generative priors support arbitrary resolutions
🔎 Similar Papers
No similar papers found.