🤖 AI Summary
Existing digital twin approaches predominantly focus on visual modeling, neglecting the critical role of acoustics in spatial realism and interactive experience. This paper introduces the first editable audio-visual digital twin system built entirely on commodity smartphones, overcoming the limitations of vision-only reconstruction by jointly modeling and co-editing geometry, surface materials, and acoustic fields. Our method integrates smartphone-captured room impulse responses (RIRs), vision-guided acoustic field estimation, differentiable acoustic rendering, and neural surface material inversion. It enables real-time interactive editing of geometric layouts and material properties, with synchronized updates of high-fidelity audio-visual renderings. Experiments conducted in real-world rooms demonstrate accurate geometric-acoustic reconstruction and consistent cross-modal editing performance. Crucially, the system requires no specialized acoustic hardware, substantially lowering the barrier to audio-visual digital twin construction.
📝 Abstract
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.