Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

In visual storytelling, large vision-language models frequently suffer from inconsistent entity referencing and hallucination due to insufficient cross-frame entity identity resolution. To address this, we propose a reinforcement training framework based on contrastive preference learning: it constructs synthetic negative samples to distinguish coherent narratives from incoherent image sequences and introduces a dual-component reward function to optimize entity association decisions. Integrating direct preference optimization (DPO) with joint entity localization and re-identification training, we fine-tune Qwen2.5-VL 7B on an extended Story Reasoning dataset, yielding Qwen Storyteller. Experiments demonstrate substantial improvements in entity consistency: grounded mAP increases by 14.8% to 0.31; F1 score rises by 17.1% to 0.41; multi-frame entity persistence improves by 13.7% to 33.3%; and structured story generation rate reaches 97.5%, a +23.3% gain.

Technology Category

Application Category

📝 Abstract

Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its'', and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).

Problem

Research questions and friction points this paper is trying to address.

Improving entity re-identification in visual storytelling systems

Reducing inconsistent references and referential hallucinations

Enhancing cross-frame character and object persistence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive reinforcement learning for entity re-identification

Synthetic negative examples for training entity connections

Dual-component reward function for grounding and re-identification

🔎 Similar Papers

No similar papers found.