While recognizing actions, LMMs struggle to detect core interaction events

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a critical deficiency in large multimodal models (LMMs): their inability to accurately localize core interaction events—such as hand-object contact and release—in video action recognition, revealing a fundamental misalignment between semantic understanding and visual grounding. To address this, we introduce the first large-scale, fine-grained dataset (>20,000 annotated interactions), with frame-level temporal annotations and pixel-level spatial coordinates for contact and release events. Built upon Something-Something-V2, our benchmark evaluates state-of-the-art LMMs—including Qwen-2.5VL and GPT-4o—within a rigorous Amazon Mechanical Turk (AMT) crowdsourcing evaluation framework. Experiments demonstrate that while models reliably classify actions and objects, their temporal localization error reaches ±3.7 frames, and spatial deviation exceeds 120 pixels. This work provides the first quantitative characterization of LMMs’ limitations in interaction grounding, establishing a new benchmark and diagnostic toolkit for spatiotemporal multimodal modeling.

Technology Category

Application Category

📝 Abstract
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
Problem

Research questions and friction points this paper is trying to address.

LMMs fail to detect when interactions begin or end
Models struggle to localize physical contact events in videos
LMMs lack perceptual grounding for dynamic scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced large-scale dataset with annotated interactions
Evaluated LMMs on pinpointing contact and release events
Found models lack perceptual grounding for event localization
Daniel Harari
Daniel Harari
Research Associate, Weizmann Institute of Science
Computer visionDeep and Machine learningArtificial intelligenceScene understandingHuman
M
Michael Sidorov
Weizmann Institute of Science
L
Liel David
Weizmann Institute of Science
C
Chen Shterental
Weizmann Institute of Science
A
Abrham Kahsay Gebreselasie
Mohamed bin Zayed University of Artificial Intelligence
M
Muhammad Haris Khan
Mohamed bin Zayed University of Artificial Intelligence