CharaConsist: Fine-Grained Consistent Character Generation

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fine-grained character consistency problem in text-to-image generation—particularly identity, attire, and background degradation under large pose changes or cross-scene transitions. We propose a training-free enhancement framework for DiT models. Our method comprises: (1) point-tracking attention, enabling stable cross-frame modeling of key semantic points; (2) adaptive token merging, dynamically compressing redundant features to improve structural stability; and (3) a novel foreground-background disentanglement control module that independently regulates character and background consistency without modifying model weights. Experiments on multi-action and multi-scene benchmarks demonstrate significant improvements over state-of-the-art methods, achieving both high visual fidelity and generation robustness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist
Problem

Research questions and friction points this paper is trying to address.

Maintain consistent background details in generated images
Ensure identity and clothing consistency during large motions
Achieve fine-grained consistency in text-to-image DiT models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses point-tracking attention for consistency
Implements adaptive token merge technique
Decouples foreground and background control
🔎 Similar Papers
No similar papers found.
M
Mengyu Wang
Institute of Information Science, Beijing Jiaotong University, Beijing, China
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
J
Jianing Peng
Institute of Information Science, Beijing Jiaotong University, Beijing, China
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University, Beijing, China
Yunpeng Chen
Yunpeng Chen
National University of Singapore
computer visionmachine learningdeep learning
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning