FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the pronounced limitations of existing vision-language models in fine-grained human activity understanding, particularly their inability to discern subtle differences in actions, interactions, and object manipulations within complex scenes. To this end, we introduce FineBench, a novel benchmark comprising 64 long-form videos and nearly 200,000 densely annotated multiple-choice questions, uniquely integrating long-duration video, high-density questioning, and frame-level spatiotemporal grounding. We further propose FineAgent, a plug-and-play modular framework that incorporates Localizer and Descriptor components to enhance spatial and temporal reasoning capabilities. Experimental results demonstrate that FineAgent substantially improves the performance of various open-source models on FineBench and exposes critical deficiencies in current architectures regarding multi-person spatial relationships and fine-grained action discrimination.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.
Problem

Research questions and friction points this paper is trying to address.

fine-grained human activity understanding
vision-language models
video question answering
spatial-temporal grounding
human-centric video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained Human Activity Understanding
Vision-Language Models
Video Question Answering
Spatial-Temporal Grounding
Modular Enhancement Framework