🤖 AI Summary
Current large language vision models (LLVMs) face fundamental bottlenecks in Activities of Daily Living (ADL) understanding: they struggle to model fine-grained actions, complex human-object interactions (HOI), and viewpoint-invariant representations—primarily due to the absence of ADL-specific instruction data and deep multimodal fusion mechanisms. To address this, we propose ADL-X—the first ADL-oriented, multi-view RGB-S (RGB + stereo depth) instruction-tuning dataset. We introduce MMPro (Multimodal Progressive training), a novel strategy that jointly models video, 3D skeletal sequences, and HOI to learn robust spatiotemporal relationships. Further, we develop LLAVIDAL, a dedicated ADL language–vision model, accompanied by an ADL-specific multiple-choice question (MCQ) and video captioning evaluation benchmark. Experiments demonstrate state-of-the-art performance on established ADL benchmarks. All resources—including the ADL-X dataset, LLAVIDAL model weights, and evaluation benchmarks—are publicly released.
📝 Abstract
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.