Salient Concept-Aware Generative Data Augmentation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative data augmentation methods struggle to simultaneously preserve image fidelity and diversity under joint vision-language prompting, primarily due to entanglement between image representations and non-essential attributes (e.g., background), causing conflicts with text prompts. To address this, we propose a saliency-aware image generation framework that explicitly disentangles and suppresses interfering visual attributes via a novel embedding model. Our method integrates vision-language conditional generation with a personalized disentanglement architecture to enable fine-grained, semantically consistent, and controllable image synthesis. Evaluated on eight fine-grained visual datasets, our approach improves classification accuracy by 0.73% on average under standard settings and by 6.5% under long-tail scenarios. It also significantly enhances vision-language alignment and preserves discriminative class-specific features.

Technology Category

Application Category

📝 Abstract
Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
Problem

Research questions and friction points this paper is trying to address.

Balancing fidelity and diversity in generative data augmentation
Preserving essential image details while aligning text prompts
Reducing entanglement of non-essential image attributes during synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses salient concept-aware embedding to reduce irrelevant details
Generates images preserving class features with controlled variations
Enhances dataset diversity and downstream model robustness
🔎 Similar Papers
2024-04-02arXiv.orgCitations: 0
T
Tianchen Zhao
AWS DS3
X
Xuanbai Chen
AWS DS3
Zhihua Li
Zhihua Li
Amazon
Deep LearningComputer Vision
J
Jun Fang
AWS DS3
D
Dongsheng An
AWS DS3
X
Xiang Xu
AWS DS3
Zhuowen Tu
Zhuowen Tu
Professor, Cognitive Science, Computer Science&Engineering, UC San Diego
Computer VisionMachine LearningDeep LearningNeural Computation
Y
Yifan Xing
AWS DS3