Visual Self-Refinement for Autoregressive Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Autoregressive models for vision-language generation suffer from an inherent mismatch between sequential token prediction and the spatial structure of visual data, leading to error accumulation and cross-modal semantic inconsistency. To address this, we propose a plug-and-play self-optimizing refinement module that jointly post-processes all generated tokens within a shared sequence prediction framework. Our core innovation is a global context-aware token relational modeling mechanism, which mitigates local prediction biases via autoregressive post-refinement and explicitly enforces spatial semantic consistency between vision and language. Crucially, the module requires no modification to the backbone architecture or retraining, making it compatible with diverse pretrained autoregressive models. Extensive experiments on multi-task vision-generation benchmarks demonstrate significant improvements in generation quality—yielding outputs that are both semantically coherent and spatially well-structured.

Technology Category

Application Category

📝 Abstract

Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial correspondence modeling in autoregressive vision-language generation

Mitigating error accumulation during sequential visual token generation

Improving semantic consistency in vision-language autoregressive modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play module refines autoregressive visual sequences

Post-pretraining jointly optimizes all generated tokens

Global context mitigates sequential error accumulation

🔎 Similar Papers

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task