🤖 AI Summary
This work addresses the limitation of existing language-guided robotic manipulation methods, which lack explicit modeling of task completion states and struggle to reliably assess whether actions satisfy both spatial and semantic goals in partially observable environments. The authors propose Forecast-GS, a novel framework that introduces, for the first time, a predictable 3D Gaussian Splatting representation of task-completion states. By integrating language-conditioned 3D scene modeling, forecast-aware Gaussian Splatting representations, and an automated action ranking mechanism, Forecast-GS establishes an interpretable bridge among language, perception, and action, enabling forward-looking decision-making based on predicted future states. Evaluated on three real-world tasks—Cutter-to-Box, Apple-to-Bowl, and Sponge-to-Tray—the method achieves automatic success rates of 21/25, 23/25, and 16/25, respectively, significantly outperforming the ReKep baseline; with minimal human assistance, performance further improves to 23/25, 24/25, and 19/25.
📝 Abstract
We introduce Forecast-aware Gaussian Splatting (Forecast-GS), a predictive 3D representation framework for language-conditioned robotic manipulation. While recent manipulation systems have made progress by grounding language instructions into robot affordances, value maps, or relational keypoint constraints, they usually reason over the current scene and do not explicitly model the task-completed state. This limitation is critical when success depends on satisfying spatial and semantic goals under partial observations, where the robot must evaluate whether a candidate action leads to a feasible task-consistent outcome.
We validate Forecast-GS on real-world pick-and-place manipulation tasks, including Cutter-to-Box, Apple-to-Bowl, and Sponge-to-Tray. For each task, we conduct 25 real-world trials under varied initial object configurations using the same robot platform and sensing setup. Forecast-GS with automatic candidate selection achieves success rates of 21/25, 23/25, and 16/25 on the three tasks, respectively, outperforming the ReKep baseline, which achieves 15/25, 19/25, and 10/25. A diagnostic human-assisted setting further improves success rates to 23/25, 24/25, and 19/25, suggesting that candidate generation is effective while automatic ranking remains imperfect. These results suggest that explicitly forecasting task-completed 3D states enables more reliable action evaluation, while the gap between automatic and human-assisted selection indicates that robust final-state ranking remains an important challenge for fully autonomous manipulation. Overall, Forecast-GS provides an interpretable bridge between language understanding, 3D perception, and robotic manipulation planning.