PushupBench: Your VLM is not good at counting pushups

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current vision-language models (VLMs) struggle with video-based repetitive action counting tasks—such as push-ups—due to inadequate temporal modeling. To address this gap, this work introduces PushupBench, a dataset comprising 446 long-form video clips, which enables the first systematic evaluation of VLMs on fine-grained repetition counting. Benchmarking across both open-source and closed-source models via the lmms-eval framework reveals that state-of-the-art closed-source models achieve only 42.1% exact-count accuracy, while open-source 4B-parameter models perform near chance at approximately 6%, highlighting a strong reliance on mode-based guessing. Further experiments demonstrate that task-specific fine-tuning substantially improves performance not only on PushupBench but also on broader video understanding benchmarks—including MVBench, PerceptionTest, and TVBench—underscoring the intrinsic link between counting ability and temporal reasoning.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)

Problem

Research questions and friction points this paper is trying to address.

repetition counting

vision-language models

video understanding

temporal reasoning

PushupBench

Innovation

Methods, ideas, or system contributions that make the work stand out.

repetition counting

vision-language models

temporal reasoning