🤖 AI Summary
To address real-time full-body motion generation from sparse sensor inputs in AR/VR, state-of-the-art (SOTA) methods rely on long-sequence temporal modeling, resulting in high computational overhead, accumulated temporal noise, and poor deployability on edge devices. This paper proposes the first lightweight MLP architecture based on a sliding temporal window: it partitions long sequences into short segments and models local temporal dependencies via a latent-space contextual fusion mechanism—achieving significant complexity reduction while preserving motion coherence. The method operates solely on sparse keypoint inputs, requiring neither pose priors nor graph-structured representations. Experiments demonstrate that our approach improves motion reconstruction accuracy by 12.3% over SOTA, reduces inference latency by 42%, and cuts memory footprint by 58%, enabling real-time (>30 FPS) full-body motion reconstruction on mobile devices.
📝 Abstract
To have a seamless user experience on immersive AR/VR applications, the importance of efficient and effective Neural Network (NN) models is undeniable, since missing body parts that cannot be captured by limited sensors should be generated using these models for a complete 3D full-body reconstruction in virtual environment. However, the state-of-the-art NN-models are typically computational expensive and they leverage longer sequences of sparse tracking inputs to generate full-body movements by capturing temporal context. Inevitably, longer sequences increase the computation overhead and introduce noise in longer temporal dependencies that adversely affect the generation performance. In this paper, we propose a novel Multi-Layer Perceptron (MLP)-based method that enhances the overall performance while balancing the computational cost and memory overhead for efficient 3D full-body generation. Precisely, we introduce a NN-mechanism that divides the longer sequence of inputs into smaller temporal windows. Later, the current motion is merged with the information from these windows through latent representations to utilize the past context for the generation. Our experiments demonstrate that generation accuracy of our method with this NN-mechanism is significantly improved compared to the state-of-the-art methods while greatly reducing computational costs and memory overhead, making our method suitable for resource-constrained devices.