Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work investigates the feasibility of extracting referring expressions from vision-grounded dialogues using autoregressive language models (LLMs) alone—i.e., identifying visually referable objects solely from linguistic context, without image input. Methodologically, we propose a next-token-prediction-based span labeling mechanism, coupled with parameter-efficient fine-tuning (e.g., LoRA), to formulate referring expression detection as a purely textual sequence labeling task. Experiments establish, for the first time, that medium-scale LLMs can effectively perform this task in few-shot settings, demonstrating that linguistic cues alone suffice for coarse-grained referent localization. However, our analysis further reveals the inherently multimodal nature of the task, exposing fundamental limitations of unimodal (text-only) approaches. This study introduces the first purely text-based baseline for visual referring expression understanding and stimulates theoretical reflection on the relationship between linguistic and visual representations.

Technology Category

Application Category

📝 Abstract

In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

Problem

Research questions and friction points this paper is trying to address.

Detecting referring expressions in visually grounded dialogue

Assessing linguistic context's role in identifying visual referents

Evaluating text-only autoregressive models for mention span annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only autoregressive language modeling approach

Pretrained LLM for mention span annotation

Parameter-efficient fine-tuning with small datasets

🔎 Similar Papers

No similar papers found.