X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in discrete autoregressive image generation—including low visual fidelity, structural distortion, and weak adherence to complex instructions—this work pioneers the integration of reinforcement learning (RL) into this paradigm to mitigate cumulative autoregressive errors and discrete tokenization-induced information loss. Methodologically, we propose a semantic image tokenizer, a unified language–image autoregressive architecture, and an offline diffusion-based decoder, coupled with a sequence-level RL optimization objective. Experiments demonstrate that our 7B-parameter model achieves state-of-the-art performance in multi-scale detail reconstruction, long-text rendering, and multi-turn instruction following, producing images with high aesthetic quality and strong semantic consistency. Our core contributions are threefold: (1) the first application of RL to optimize discrete autoregressive image generation; (2) significant improvements in generation quality and controllability; and (3) a unified multimodal modeling framework grounded in discrete autoregression.

Technology Category

Application Category

📝 Abstract
Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
Problem

Research questions and friction points this paper is trying to address.

Improves visual fidelity in autoregressive image generation
Reduces distortion in discrete token-based image outputs
Enhances adherence to complex instructions for details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhances autoregressive image generation
Semantic tokenizer and unified model for language and images
Offline diffusion decoder improves image quality
🔎 Similar Papers
No similar papers found.
Z
Zigang Geng
Tencent Hunyuan X
Y
Yibing Wang
Tencent Hunyuan X
Y
Yeyao Ma
Tencent Hunyuan X
C
Chen Li
Tencent Hunyuan X
Yongming Rao
Yongming Rao
Tencent Hunyuan
computer visiondeep learning
Shuyang Gu
Shuyang Gu
Microsoft Research Asia
computer visiongenerative model
Z
Zhao Zhong
Tencent Hunyuan X
Q
Qinglin Lu
Tencent Hunyuan X
H
Han Hu
Tencent Hunyuan X
Xiaosong Zhang
Xiaosong Zhang
Tencent
L
Linus
Tencent Hunyuan X
D
Di Wang
Tencent Hunyuan X
J
Jie Jiang
Tencent Hunyuan X