CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Text-to-image diffusion models struggle to align generated images with complex semantic compositions—such as object relations, attributes, and spatial arrangements. To address this without fine-tuning, we propose an inference-time optimization framework that dynamically selects the optimal reward function via category-aware reward modeling, jointly guiding both initial noise optimization and sampling path exploration. Crucially, our method incorporates human-judgment-relevant correlation modeling into the reverse diffusion sampling process, enabling precise alignment enhancement. Evaluated on T2I-CompBench++ and HRS benchmarks, our approach achieves average alignment score improvements of 16% and 11%, respectively—outperforming state-of-the-art methods while preserving image fidelity and diversity. The core innovation lies in unifying reward function selection and noise-space optimization entirely within the inference stage—a first in diffusion-based generation—thereby ensuring robustness and strong generalization across diverse compositional tasks.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.

Problem

Research questions and friction points this paper is trying to address.

Text-to-image models fail to achieve compositional alignment with complex prompts

Existing noise optimization and exploration methods have intrinsic limitations when used alone

Single reward metrics cannot reliably capture all aspects of compositionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining noise optimization and exploration strategies

Using category-aware reward selection based on human correlation

Improving text-image alignment without model fine-tuning

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL