🤖 AI Summary
Existing multimodal learning lacks consistent supervisory signals under editing operations, viewpoint changes, and scene interventions. This work proposes SceneForge, a framework that, for the first time, models intervention consistency as structured state transitions within an editable 3D world. By applying explicit interventions—such as object removal or camera transformations—and propagating their effects through semantic, geometric, and physical dependencies, SceneForge generates aligned supervision comprising counterfactual observations, multi-view images, and effect-aware signals. Built upon editable indoor scenes created with Infinigen and Blender, the framework leverages 3D scene graphs and physically based rendering to enable efficient intervention propagation. Experiments demonstrate that, under identical training budgets, SceneForge significantly improves both quantitative and qualitative performance on scene editing and object removal tasks across multiple benchmarks.
📝 Abstract
Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.