🤖 AI Summary
In multi-agent motion generation, existing LLM-based autoregressive models suffer from misalignment between their pretraining objective (token prediction) and human preferences, necessitating costly post-training preference alignment via human annotations—especially infeasible for large-scale, high-cost multi-agent scenarios. Conventional methods treat all generated samples as negative examples, ignoring intrinsic preference rankings among trajectories, leading to distorted alignment.
Method: We propose a zero-annotation direct preference alignment framework that, for the first time, mines implicit preference signals from pretraining demonstrations to construct fine-grained trajectory-level rankings. Our approach integrates implicit preference modeling, motion trajectory contrastive learning, and lightweight reward modeling, all tailored to autoregressive architectures.
Contribution/Results: Evaluated on large-scale traffic simulation, our 1M-parameter model achieves state-of-the-art imitation learning performance, significantly enhancing behavioral realism—without human annotations or additional computational overhead.
📝 Abstract
Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer, making post-training preference alignment crucial for producing human-preferred motions. Unfortunately, post-training alignment requires extensive preference rankings of motions generated by the pre-trained model, which are costly to annotate, especially in multi-agent settings. Recently, there has been growing interest in leveraging pre-training demonstrations to scalably generate preference data for post-training alignment. However, these methods often adopt an adversarial assumption, treating all pre-trained model-generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model's own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre-training demonstrations to construct preference rankings among the pre-trained model's generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of pre-trained model's generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation-based models by relying solely on implicit feedback from pre-training demonstrations, without additional post-training human preference annotations or high computational costs.