Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

πŸ“… 2025-06-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates the feasibility of extracting referring expressions from vision-grounded dialogues using autoregressive language models (LLMs) aloneβ€”i.e., identifying visually referable objects solely from linguistic context, without image input. Methodologically, we propose a next-token-prediction-based span labeling mechanism, coupled with parameter-efficient fine-tuning (e.g., LoRA), to formulate referring expression detection as a purely textual sequence labeling task. Experiments establish, for the first time, that medium-scale LLMs can effectively perform this task in few-shot settings, demonstrating that linguistic cues alone suffice for coarse-grained referent localization. However, our analysis further reveals the inherently multimodal nature of the task, exposing fundamental limitations of unimodal (text-only) approaches. This study introduces the first purely text-based baseline for visual referring expression understanding and stimulates theoretical reflection on the relationship between linguistic and visual representations.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
Problem

Research questions and friction points this paper is trying to address.

Detecting referring expressions in visually grounded dialogue
Assessing linguistic context's role in identifying visual referents
Evaluating text-only autoregressive models for mention span annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only autoregressive language modeling approach
Pretrained LLM for mention span annotation
Parameter-efficient fine-tuning with small datasets
πŸ”Ž Similar Papers
No similar papers found.
B
Bram Willemsen
Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Gabriel Skantze
Gabriel Skantze
Professor at KTH, PhD in Speech Communication and Technology
Conversational AISpeechHuman-robot interactionNLP