Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenges of identity drift, background inconsistency, and semantic degradation in long-form stylized or actor-replaced cinematic content, which arise from frequent shot transitions and viewpoint changes. To tackle these issues, the authors propose a multi-agent collaborative framework that leverages a scene-level JSON script as a semantic backbone, integrating dynamic visual reference anchors with a grid-based batch keyframe generation mechanism. Building upon shared contextual modeling in latent space, the framework enables joint keyframe synthesis and incorporates closed-loop verification with selective regeneration for rigorous identity and alignment auditing. Its core innovation lies in the novel Dual-Bridge Consistency mechanism, which effectively enforces long-term language–vision coherence across hundreds of shots. Evaluated on the SoapBench benchmark, the method significantly outperforms existing commercial video generation APIs, demonstrating superior narrative fidelity and long-range consistency.

📝 Abstract

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

Problem

Research questions and friction points this paper is trying to address.

long-horizon video generation

cinematic remaking

identity consistency

narrative structure preservation

video-to-video translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent collaboration

long-horizon video generation

Dual-Bridge Consistency