IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

๐Ÿ“… 2025-08-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing action question answering (AQA) methods rely on explicit program execution and manually designed modules, limiting scalability and generalization. To address this, we propose IMoRe, an Implicit Program-guided Reasoning framework that pioneers the direct use of structured program functions for implicit reasoning. IMoRe introduces a program-guided reading mechanism to dynamically select multi-level action representations, integrated with a pre-trained action ViT backbone, program-driven attention, and an iterative memory update moduleโ€”enabling multi-granularity feature extraction and unified handling of diverse query types. Evaluated on Babel-QA, IMoRe achieves state-of-the-art performance. Moreover, it demonstrates strong generalization on HuMMan, a newly constructed human motion AQA benchmark. Both code and the HuMMan dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing human motion Q&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.
Problem

Research questions and friction points this paper is trying to address.

Overcoming scalability limits in human motion Q&A methods
Unifying reasoning across query types without manual modules
Ensuring precise execution of reasoning steps via structured programs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit program-guided motion reasoning framework
Program-guided dynamic multi-level motion selection
Iterative memory refinement with structured programs
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chen Li
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
Chinthani Sugandhika
Chinthani Sugandhika
PhD Student, School of Computer Science and Engineering, Nanyang Technological University, Singapore
Computer VisionAction RecognitionVideo Understanding
Y
Yeo Keat Ee
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
E
Eric Peh
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
H
Hao Zhang
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
H
Hong Yang
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
Deepu Rajan
Deepu Rajan
Nanyang Technological University
Image ProcessingComputer Vision
Basura Fernando
Basura Fernando
Scientist at A*STAR Singapore, Assistant Professor at NTU
Visual ReasoningAction PredictionAction RecognitionTransfer LearningEmbodied AI