Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

๐Ÿ“… 2025-07-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the video-driven visual planning assistance (VPA) taskโ€”predicting structured action sequences required to achieve a given goal. Key challenges include: (1) severe scarcity of procedural annotations, hindering learning of action dynamics; and (2) limitations of standard single-token language modeling in capturing the discrete, ordered nature of action spaces. To tackle these, we propose an auxiliary-task-enhanced framework with multi-head, multi-token prediction: it introduces auxiliary objectives (e.g., goal prediction) to alleviate data scarcity and explicitly models the structured semantics of action sequences. Built upon multimodal large language models (MLLMs), our approach achieves +7.3% and +3.4% absolute improvements in 3-step action prediction accuracy on COIN and CrossTask, respectively, and attains state-of-the-art performance on Ego4D, demonstrating significantly enhanced long-horizon visual planning capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model's ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model's planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. Code will be made available.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of procedural annotations in video-based planning
Improving next-token prediction for structured action space modeling
Enhancing long-horizon visual planning with auxiliary tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auxiliary Task Augmentation enhances procedural learning
Multi-token Prediction captures structured action space
VideoPlan achieves state-of-the-art VPA performance
๐Ÿ”Ž Similar Papers
No similar papers found.