LLaVAction: evaluating and training multi-modal large language models for action recognition

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study investigates performance bottlenecks of multimodal large language models (MLLMs) in fine-grained human action recognition. To address this, we introduce EPIC-KITCHENS-100-MQA—the first multiple-choice question-answering benchmark tailored to real-world, first-person video. It systematically reformulates the large-scale EPIC-KITCHENS-100 action dataset into a structured QA format. Methodologically, we propose an end-to-end fine-tuning paradigm integrating video-text alignment, hard negative sampling, instruction augmentation, and contrastive learning. Experiments demonstrate that our approach achieves state-of-the-art performance on the EPIC-KITCHENS-100 validation set, outperforming GPT-4o by 21 percentage points. Moreover, it delivers consistent improvements across five major egocentric action understanding benchmarks—including EgoSchema and PerceptionTest—validating the efficacy of the QA paradigm in enhancing MLLMs’ action comprehension capabilities.

Technology Category

Application Category

📝 Abstract

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs for action recognition tasks

Improving MLLMs' accuracy in video question answering

Enhancing performance on complex action benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulate action dataset to video multiple QA

Improve MLLMs for action recognition tasks

Achieve state-of-the-art on multiple benchmarks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs