Non-autoregressive Sequence-to-Sequence Vision-Language Models

📅 2024-03-04

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 3

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Autoregressive vision-language models suffer from high inference latency—scaling linearly with sequence length O(L)—hindering practical deployment. Method: This paper proposes the first non-autoregressive sequence-to-sequence framework for image-to-text generation. Its core innovation is the introduction of Query-CTC loss into vision-language modeling, enabling direct optimization of the joint token distribution (rather than conditional distributions) and achieving constant-time O(1) parallel decoding. The approach integrates latent variable path marginalization with end-to-end training, eliminating the need for predefined alignments or external aligners. Results: Experiments demonstrate that the method matches the accuracy of state-of-the-art autoregressive models across diverse vision-language understanding and generation tasks, while significantly accelerating inference. It thus achieves a favorable trade-off between efficiency and performance without compromising fidelity.

Technology Category

Application Category

📝 Abstract

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

Problem

Research questions and friction points this paper is trying to address.

Reduces inference latency in vision-language models

Proposes parallel decoding with Query-CTC loss

Achieves constant time joint inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel decoding for vision-language models

Query-CTC loss for multiple inference paths

Constant time joint inference for speed

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs