InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multi-condition coordinated control in e-commerce poster generation, where issues such as subject distortion, inaccurate text rendering, and style inconsistency commonly arise. The authors propose InnoAds-Composer, a single-stage generation framework that jointly modulates subject appearance, typography, and visual style through an efficient tri-condition control token mechanism. To reduce computational overhead, they introduce a condition routing strategy based on importance analysis. Additionally, they construct the first high-quality Chinese e-commerce poster dataset encompassing subject, text, and style attributes, and incorporate a Text Feature Enhancement Module (TFEM) to fuse glyph and cropping features. With negligible increase in inference latency, the proposed method significantly outperforms existing approaches in subject fidelity, text accuracy, and style consistency.

Technology Category

Application Category

📝 Abstract
E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.
Problem

Research questions and friction points this paper is trying to address.

e-commerce poster generation
subject fidelity
text accuracy
style consistency
conditional control
Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-conditional control
single-stage diffusion
token routing
text feature enhancement
e-commerce poster generation
Y
Yuxin Qin
JD.com, Inc., Beijing, China
K
Ke Cao
JD.com, Inc., Beijing, China
Haowei Liu
Haowei Liu
TongYi Lab, Alibaba Group
Multimodal Learning
Ao Ma
Ao Ma
JD.com
Generative AIVideo Generation
F
Fengheng Li
JD.com, Inc., Beijing, China
H
Honghe Zhu
JD.com, Inc., Beijing, China
Z
Zheng Zhang
JD.com, Inc., Beijing, China
R
Run Ling
JD.com, Inc., Beijing, China
W
Wei Feng
JD.com, Inc., Beijing, China
Xuanhua He
Xuanhua He
The Hong Kong University of Science and Technology
low level visionvideo generation
Zhanjie Zhang
Zhanjie Zhang
Zhejiang University
computer vision
Z
Zhen Guo
JD.com, Inc., Beijing, China
H
Haoyi Bian
JD.com, Inc., Beijing, China
J
Jingjing Lv
JD.com, Inc., Beijing, China
J
Junjie Shen
JD.com, Inc., Beijing, China
Ching Law
Ching Law
MIT