🤖 AI Summary
This work proposes IMU-to-4D, a novel framework that overcomes the limitations of visual sensors in privacy, safety, power consumption, and scalability by introducing large language models to non-visual spatiotemporal perception for the first time. Leveraging only inertial measurement unit (IMU) signals from everyday wearable devices—such as earbuds, smartwatches, or smartphones—the method enables end-to-end joint reconstruction of four-dimensional human motion trajectories and coarse 3D scene layouts. Experiments across multiple diverse human-scene datasets demonstrate that IMU-to-4D substantially outperforms existing cascaded approaches, producing more temporally coherent and stable 4D reconstructions. These results establish that IMU signals alone are sufficient to support rich human-scene understanding without relying on visual input.
📝 Abstract
Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.