GEMS: Agent-Native Multimodal Generation with Memory and Skills

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing general-purpose multimodal generative models exhibit limited performance on complex instructions and specialized downstream tasks. This work proposes an agent-native multimodal generation framework that significantly enhances output quality and task adaptability through multi-agent closed-loop collaborative optimization, trajectory-level hierarchical memory mechanisms, and modular domain-specific skills invoked on demand. The unified architecture effectively bridges the performance gap between general-purpose and task-specific capabilities inherent in foundational models. Experimental results demonstrate that Z-Image-Turbo, a lightweight 6B-parameter model built upon this framework, consistently outperforms the current state-of-the-art model Nano Banana 2 across five major benchmark tasks and four downstream applications.

Technology Category

Application Category

📝 Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

Problem

Research questions and friction points this paper is trying to address.

multimodal generation

complex instructions

downstream tasks

foundational models

agent frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-Native Framework

Multimodal Generation

Memory-Augmented Learning