A Versatile Multimodal Agent for Multimedia Content Generation

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing AIGC models are often confined to single modalities or specific scenarios, struggling to generate complex multimodal content that seamlessly integrates audio, video, and text in an end-to-end manner. This work proposes a multimodal agent system grounded in skill acquisition theory to guide data construction and training. The approach features a two-stage planning optimization strategy—comprising autocorrelation modeling and preference alignment—and a three-stage fine-tuning pipeline involving base training, successful-plan fine-tuning, and preference optimization. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art models, achieving notable improvements in both generation quality and alignment with human preferences.

Technology Category

Application Category

📝 Abstract

With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.

Problem

Research questions and friction points this paper is trying to address.

AIGC

multimodal generation

content creation

agent-based systems

multimedia editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Agent

Skill Acquisition Theory

Plan Optimization