🤖 AI Summary
Existing 3D human reconstruction methods suffer from sensor-specific biases in single-modal setups or limited generalizability and robustness in multi-modal fusion approaches that rely on fixed hardware configurations and precise calibration. Conventional transformer architectures—coupling point-cloud and image modalities via explicit geometric projection—exhibit sensitivity to noise, pose variation, and modality dropout. This work proposes a universal adaptive multi-modal fusion framework: for the first time, heterogeneous, uncalibrated, multi-view inputs are uniformly tokenized as平等 tokens; a learnable modality sampling module enables dynamic adaptation to varying input counts, modalities, and noise levels; and geometry-free, view-invariant tokenization replaces explicit projection and modality binding. Evaluated on large-scale benchmarks, our method significantly outperforms state-of-the-art approaches, maintaining high accuracy and strong robustness under challenging conditions—including low-quality inputs, uncalibrated sensors, and partial modality absence.
📝 Abstract
Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.