Rodent-Bench

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of multimodal large language models (MLLMs) in annotating rodent behavioral videos by introducing the first neuroscience-oriented benchmark dataset, encompassing diverse behavioral paradigms and long-duration video sequences. The authors propose a comprehensive evaluation framework incorporating multiple metrics—second-level accuracy, macro F1 score, mean average precision, mutual information, and Matthews correlation coefficient—to systematically assess the performance of state-of-the-art models such as Gemini-2.5-Pro and Qwen-VL-Max. Results reveal significant bottlenecks in temporal segmentation and fine-grained behavior recognition, with models showing only marginal success on a few behaviors like grooming. These findings underscore the current inadequacy of MLLMs in understanding long-form videos and discriminating subtle behavioral patterns critical to neuroscience research.

Technology Category

Application Category

📝 Abstract
We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
rodent behavior annotation
video understanding
temporal segmentation
behavioral neuroscience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rodent-Bench
Multimodal Large Language Models
behavioral video annotation
temporal segmentation
standardized evaluation metrics
🔎 Similar Papers
2024-03-15arXiv.orgCitations: 2
Thomas Heap
Thomas Heap
Phd Student, University of Bristol
Artificial Intelligence
Laurence Aitchison
Laurence Aitchison
University of Bristol
Deep Learning
E
Emma N. Cahill
University of Bristol
A
Adriana Casado Rodriguez
University of Bristol