StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual storytelling frequently suffers from referential hallucinations caused by cross-frame character identity drift and action–subject mismatches. To address this, we introduce the first visual narrative dataset comprising 4,178 stories with explicit cross-frame character consistency. Our method integrates multi-frame object re-identification, chain-of-logic reasoning, and a cross-frame text–vision joint grounding framework to enable explicit narrative reasoning and long-term entity consistency modeling. Technically, it combines visual similarity matching, face detection and recognition, multi-task fine-tuning of Qwen2.5-VL 7B (for object detection, re-identification, and keypoint localization), and structured tabular representation. Evaluated on our newly constructed benchmark, the fine-tuned model—Qwen Storyteller—reduces average hallucinations per story from 4.06 to 3.56 (−12.3%), significantly improving narrative grounding accuracy and cross-frame consistency. This work establishes a novel paradigm for entity-consistent visual storytelling.

Technology Category

Application Category

📝 Abstract
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
Problem

Research questions and friction points this paper is trying to address.

Maintaining character identity across visual storytelling frames
Linking actions to correct subjects to avoid hallucinations
Grounding textual elements to visual entities consistently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-frame object re-identification via visual similarity
Chain-of-thought reasoning for narrative modeling
Visual-textual grounding across multiple frames
🔎 Similar Papers
No similar papers found.