🤖 AI Summary
Guitar playing demands high temporal precision and coordinated bimanual control, posing significant challenges for end-to-end robotic imitation and reinforcement learning due to the curse of dimensionality in joint policy optimization.
Method: We propose a dual-agent collaborative reinforcement learning framework that decouples left- and right-arm policies for independent training, then enforces temporal synchronization via latent-space alignment—bypassing direct high-dimensional joint policy optimization. The approach integrates motion-capture-driven behavioral modeling, MuJoCo-based physics simulation, and joint rhythm-fingering constraints to enable generalization from unstructured motion data to unseen guitar tablatures.
Contribution/Results: We introduce the first “train-separately–align-in-latent” paradigm for bimanual music execution, achieving 3.2× faster training convergence versus joint training and improving motion naturalness (41% reduction in Fréchet Inception Distance). Crucially, our method is the first to realize end-to-end physically grounded control for complex chord formation and millisecond-precise plucking rhythms.
📝 Abstract
We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training. Code is available at: https://pei-xu.github.io/guitar.