Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses “visual overthinking”—a phenomenon in autoregressive multimodal large language models (MLLMs) where chain-of-thought (CoT) reasoning for image generation leads to redundant inference, increased computational overhead, semantic inconsistency, and degraded output quality. To mitigate this, we propose ShortCoTI, a lightweight optimization framework that introduces, for the first time, an adaptive difficulty-aware reward mechanism within a reinforcement learning paradigm to achieve semantics-preserving compression of CoT sequences. ShortCoTI jointly optimizes prompt conciseness and image fidelity. Evaluated across multiple benchmarks, it reduces average CoT prompt length by 54% while maintaining or slightly improving image quality—thereby significantly enhancing generation efficiency and step-wise consistency.

Technology Category

Application Category

📝 Abstract

Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in autoregressive image generation models

Eliminating visual overthinking from chain-of-thought reasoning

Maintaining image quality while shortening reasoning prompt length

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight optimization framework for concise CoT

Adaptive reward function based on task difficulty

Reinforcement learning reduces reasoning length by 54%

🔎 Similar Papers

Efficient generative adversarial networks using linear additive-attention Transformers