LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

📅 2024-06-13

📈 Citations: 2

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current large language vision models (LLVMs) face fundamental bottlenecks in Activities of Daily Living (ADL) understanding: they struggle to model fine-grained actions, complex human-object interactions (HOI), and viewpoint-invariant representations—primarily due to the absence of ADL-specific instruction data and deep multimodal fusion mechanisms. To address this, we propose ADL-X—the first ADL-oriented, multi-view RGB-S (RGB + stereo depth) instruction-tuning dataset. We introduce MMPro (Multimodal Progressive training), a novel strategy that jointly models video, 3D skeletal sequences, and HOI to learn robust spatiotemporal relationships. Further, we develop LLAVIDAL, a dedicated ADL language–vision model, accompanied by an ADL-specific multiple-choice question (MCQ) and video captioning evaluation benchmark. Experiments demonstrate state-of-the-art performance on established ADL benchmarks. All resources—including the ADL-X dataset, LLAVIDAL model weights, and evaluation benchmarks—are publicly released.

Technology Category

Application Category

📝 Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.

Problem

Research questions and friction points this paper is trying to address.

LLVMs struggle with fine-grained ADL details and HOIs

Lack of specialized ADL datasets and modality integration

Need for robust spatiotemporal modeling in daily activities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated framework for ADL dataset curation

Multimodal Progressive (MMPro) training strategy

Integration of videos, 3D skeletons, and HOIs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs