🤖 AI Summary
In visual storytelling, large vision-language models frequently suffer from inconsistent entity referencing and hallucination due to insufficient cross-frame entity identity resolution. To address this, we propose a reinforcement training framework based on contrastive preference learning: it constructs synthetic negative samples to distinguish coherent narratives from incoherent image sequences and introduces a dual-component reward function to optimize entity association decisions. Integrating direct preference optimization (DPO) with joint entity localization and re-identification training, we fine-tune Qwen2.5-VL 7B on an extended Story Reasoning dataset, yielding Qwen Storyteller. Experiments demonstrate substantial improvements in entity consistency: grounded mAP increases by 14.8% to 0.31; F1 score rises by 17.1% to 0.41; multi-frame entity persistence improves by 13.7% to 33.3%; and structured story generation rate reaches 97.5%, a +23.3% gain.
📝 Abstract
Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames,
often failing to recognize when entities in different images represent the same individuals or objects,
leading to inconsistent references and referential hallucinations.
This occurs because models lack explicit training on when to establish entity connections across frames.
We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences
and stories from unrelated images.
We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior.
We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities
in real stories while penalizing incorrect entity connections in synthetic contexts.
Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B).
Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%).
Pronoun grounding accuracy improved across all pronoun types except ``its'',
and cross-frame character and object persistence increased
across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%).
Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).