OMG-Agent: Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agentic Workflows

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation in generation quality of multimodal systems under missing modalities, which often stems from hallucination or rigid retrieval, and the difficulty of end-to-end approaches in disentangling semantics from fine-grained details. To this end, the authors propose a dynamic coarse-to-fine agent workflow that, for the first time, decouples missing modality generation into two distinct stages: logical reasoning and signal synthesis. A multimodal large language model (MLLM)-driven semantic planner produces a deterministic semantic plan, which guides a non-parametric evidence retriever to fetch relevant external knowledge; this retrieved information is then injected into an executor for high-fidelity synthesis. The method significantly outperforms existing approaches across multiple benchmarks, achieving a 2.6-point improvement on the CMU-MOSI dataset under a 70% modality missing rate, demonstrating strong robustness and generation fidelity.

Technology Category

Application Category

📝 Abstract
Data incompleteness severely impedes the reliability of multimodal systems. Existing reconstruction methods face distinct bottlenecks: conventional parametric/generative models are prone to hallucinations due to over-reliance on internal memory, while retrieval-augmented frameworks struggle with retrieval rigidity. Critically, these end-to-end architectures are fundamentally constrained by Semantic-Detail Entanglement -- a structural conflict between logical reasoning and signal synthesis that compromises fidelity. In this paper, we present \textbf{\underline{O}}mni-\textbf{\underline{M}}odality \textbf{\underline{G}}eneration Agent (\textbf{OMG-Agent}), a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow. By mimicking a \textit{deliberate-then-act} cognitive process, OMG-Agent explicitly decouples the task into three synergistic stages: (1) an MLLM-driven Semantic Planner that resolves input ambiguity via Progressive Contextual Reasoning, creating a deterministic structured semantic plan; (2) a non-parametric Evidence Retriever that grounds abstract semantics in external knowledge; and (3) a Retrieval-Injected Executor that utilizes retrieved evidence as flexible feature prompts to overcome rigidity and synthesize high-fidelity details. Extensive experiments on multiple benchmarks demonstrate that OMG-Agent consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness, e.g., a $2.6$-point gain on CMU-MOSI at $70$\% missing rates.
Problem

Research questions and friction points this paper is trying to address.

missing modality
multimodal systems
data incompleteness
semantic-detail entanglement
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Missing Modality Generation
Decoupled Coarse-to-Fine Workflow
Semantic-Detail Entanglement
Agentic Reasoning
Retrieval-Augmented Generation
R
Ruiting Dai
University of Electronic Science and Technology of China
Z
Zheyu Wang
University of Electronic Science and Technology of China
H
Haoyu Yang
University of Electronic Science and Technology of China
Y
Yihan Liu
University of Electronic Science and Technology of China
C
Chengzhi Wang
University of Electronic Science and Technology of China
Zekun Zhang
Zekun Zhang
Stony Brook University
Computer VisionMachine Learning
Z
Zishan Huang
University of Electronic Science and Technology of China
J
Jiaman Cen
University of Electronic Science and Technology of China
L
Lisi Mo
University of Electronic Science and Technology of China