🤖 AI Summary
Existing visual storytelling approaches predominantly rely on textual inputs alone, struggling to effectively integrate multimodal conditions such as character identity images, scene backgrounds, and shot types, which limits their customizability and cinematic expressiveness. To address this, this work proposes VstoryGen, a novel framework that, for the first time, jointly models these three conditioning modalities within a unified multimodal large language model and introduces an explicit shot-type control mechanism. Leveraging a parameter-efficient prompt-tuning strategy trained on cinematic data, VstoryGen enables customizable visual story generation with high consistency and diversity. Experimental results on two newly established evaluation benchmarks demonstrate that VstoryGen significantly outperforms existing methods in character-scene consistency, image-text alignment, and shot control, thereby enhancing narrative coherence and cinematic language expression.
📝 Abstract
Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.