🤖 AI Summary
Existing visual storytelling models often misattribute dialogue, hallucinate character interactions, or infer inaccurate emotional states due to a lack of grounded semantic evidence. To address this, this work introduces StoryMovie, a novel dataset that aligns 1,757 visual stories with corresponding movie scripts and subtitles at the level of character names and timestamps using the Longest Common Subsequence (LCS) algorithm—enabling precise dialogue attribution and dynamic relationship modeling. Building upon Qwen Storyteller3, the proposed approach integrates entity re-identification and multimodal alignment techniques. Evaluated on DeepSeek V3, it achieves an 89.9% win rate on subtitle alignment and significantly improves dialogue attribution accuracy from 38.0% to 48.5%, outperforming methods relying solely on visual grounding.
📝 Abstract
Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding,
Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.