🤖 AI Summary
Existing video avatars maintain identity consistency and motion alignment but lack environmental adaptability and long-horizon, goal-directed autonomy. This paper introduces the first video avatar framework supporting active intelligence, built upon a closed-loop OTAR cycle (Observe–Think–Act–Reflect) and a dual-system hierarchical cognitive architecture. It integrates Partially Observable Markov Decision Process (POMDP) modeling, online belief updating, an internal world model (IWM), multi-granularity action captioning, and real-time generative output verification—enabling robust state tracking under uncertainty and synergistic strategic-executive reasoning. Evaluated on the L-IVA benchmark, our framework achieves significant improvements in task success rate and behavioral coherence. It is the first to accomplish open-domain, multi-step autonomous task completion, advancing video avatars from passive motion reproduction toward goal-oriented intelligent agents.
📝 Abstract
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.