🤖 AI Summary
This work addresses the limited generalization of conventional monocular head pose estimation methods, which rely on absolute pose regression constrained by dataset-specific implicit reference frames. The paper reframes the task as relative pose prediction for the first time, estimating the rigid transformation between two image frames and introducing explicit pose anchors to eliminate dependence on a fixed coordinate system. This formulation further enables flexible adjustment of prediction difficulty during inference. Built upon a general-purpose geometric foundation model and fine-tuned exclusively with synthetic facial renderings, the proposed approach outperforms absolute regression methods—despite their reliance on real or mixed training data—on the BIWI benchmark. Additional experiments on challenging samples demonstrate that the advantage of relative prediction becomes markedly more pronounced as pose complexity increases.
📝 Abstract
Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE