๐ค AI Summary
Current embodied intelligence systems lack a unified benchmark for comprehensively evaluating perception, decision-making, and dexterous manipulation in dynamic tabletop environments. This work proposes DexHoldemโthe first embodied evaluation framework that integrates the complex rule structure of Texas Holdโem poker with a high-degree-of-freedom dexterous hand (ShadowHand). The framework establishes a closed-loop assessment protocol through 14 manipulation primitives and 1,470 teleoperated demonstrations, encompassing embodied perception, policy execution, and state recovery. Experiments show that a multi-strategy controller (ฯโ.โ
) achieves a task completion rate of 61.2%, while Opus 4.7 and GPT 5.5 exhibit superior performance in question-level and field-level perception, respectively. The results further reveal a significant gap between sub-module capabilities and overall state recovery proficiency.
๐ Abstract
Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $ฯ_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $ฯ_{0.5}$ and $ฯ_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.