🤖 AI Summary
Current vision-language models (VLMs) struggle with video-based repetitive action counting tasks—such as push-ups—due to inadequate temporal modeling. To address this gap, this work introduces PushupBench, a dataset comprising 446 long-form video clips, which enables the first systematic evaluation of VLMs on fine-grained repetition counting. Benchmarking across both open-source and closed-source models via the lmms-eval framework reveals that state-of-the-art closed-source models achieve only 42.1% exact-count accuracy, while open-source 4B-parameter models perform near chance at approximately 6%, highlighting a strong reliance on mode-based guessing. Further experiments demonstrate that task-specific fine-tuning substantially improves performance not only on PushupBench but also on broader video understanding benchmarks—including MVBench, PerceptionTest, and TVBench—underscoring the intrinsic link between counting ability and temporal reasoning.
📝 Abstract
Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)