LongCat-Image Technical Report

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address limitations of mainstream image generation models in multilingual text rendering—particularly for Chinese—photorealism, deployment efficiency, and developer accessibility, this paper introduces the first open-source bilingual (Chinese-English) vision-language foundation model. Methodologically, it employs a 6B-parameter diffusion architecture, a refined multi-stage training paradigm (pretraining → mid-training → supervised fine-tuning), reward-model-guided reinforcement learning, and a high-quality bilingual paired dataset alongside a fully open-sourced toolchain. Contributions and results: The model achieves state-of-the-art performance in Chinese character rendering—including complex and rare glyphs—as well as in image aesthetics and photorealism, significantly outperforming existing open-source and leading commercial models. It supports efficient inference, low-resource deployment, and high-fidelity, consistent image editing, thereby advancing multilingual AI-generated content infrastructure.

Technology Category

Application Category

📝 Abstract
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
Problem

Research questions and friction points this paper is trying to address.

Enhances multilingual text rendering and photorealism in image generation
Improves deployment efficiency through compact model architecture design
Increases developer accessibility via comprehensive open-source ecosystem
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual data curation and reward models for SOTA quality
Compact 6B parameter diffusion model for efficient deployment
Comprehensive open-source ecosystem with full training toolchain
🔎 Similar Papers
No similar papers found.
M
Meituan LongCat Team
H
Hanghang Ma
H
Haoxian Tan
J
Jiale Huang
J
Junqiang Wu
Jun-Yan He
Jun-Yan He
Tongyi Lab, Alibaba Group
Multimedia ComputingComputer Vision
L
Lishuai Gao
S
Songlin Xiao
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning
Xiaoqi Ma
Xiaoqi Ma
X
Xunliang Cai
Y
Yayong Guan
J
Jie Hu