SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address semantic distortion and coarse-grained relational modeling in multi-object action description for high-fidelity text-to-image generation, this paper proposes an enhanced method built upon the CONFORM framework. We introduce semantic hypergraph contrastive adjacency learning—a novel paradigm that constructs a “contrasting yet cohesive” structural representation to enable fine-grained alignment between actions and objects, as well as precise modeling of their interaction relationships. Furthermore, we integrate InteractDiffusion to augment Stable Diffusion’s comprehension of verbs and interactive actions. Our approach achieves statistically significant improvements over baselines in CLIP Score, TIFA, and human evaluation—particularly enhancing fidelity for action-oriented prompts where Stable Diffusion inherently underperforms. To our knowledge, this is the first work to incorporate semantic hypergraphs into action-aware text-to-image generation, empirically validating their effectiveness in complex, multi-object interaction scenarios.

Technology Category

Application Category

📝 Abstract

In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and"contrast but link"technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: https://polybox.ethz.ch/index.php/s/dJm3SWyRohUrFxn

Problem

Research questions and friction points this paper is trying to address.

Improve fidelity in text-to-image generation

Enhance action depiction with multiple objects

Utilize contrastive learning for better accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically hypergraphic contrastive adjacency learning

Enhanced contrastive structure comprehension

InteractDiffusion amends action understanding

🔎 Similar Papers

No similar papers found.