SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic distortion and coarse-grained relational modeling in multi-object action description for high-fidelity text-to-image generation, this paper proposes an enhanced method built upon the CONFORM framework. We introduce semantic hypergraph contrastive adjacency learning—a novel paradigm that constructs a “contrasting yet cohesive” structural representation to enable fine-grained alignment between actions and objects, as well as precise modeling of their interaction relationships. Furthermore, we integrate InteractDiffusion to augment Stable Diffusion’s comprehension of verbs and interactive actions. Our approach achieves statistically significant improvements over baselines in CLIP Score, TIFA, and human evaluation—particularly enhancing fidelity for action-oriented prompts where Stable Diffusion inherently underperforms. To our knowledge, this is the first work to incorporate semantic hypergraphs into action-aware text-to-image generation, empirically validating their effectiveness in complex, multi-object interaction scenarios.

Technology Category

Application Category

📝 Abstract
In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and"contrast but link"technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: https://polybox.ethz.ch/index.php/s/dJm3SWyRohUrFxn
Problem

Research questions and friction points this paper is trying to address.

Improve fidelity in text-to-image generation
Enhance action depiction with multiple objects
Utilize contrastive learning for better accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically hypergraphic contrastive adjacency learning
Enhanced contrastive structure comprehension
InteractDiffusion amends action understanding
🔎 Similar Papers
No similar papers found.
T
Tianxiang Xia
L
Lin Xiao
Y
Yannick Montorfani
F
Francesco Pavia
Enis Simsar
Enis Simsar
ETH Zurich
Computer Vision
T
Thomas Hofmann