🤖 AI Summary
To address the accuracy-efficiency trade-off in head pose estimation on edge devices, this paper proposes Grouped Attention Deep Sets (GADS), a lightweight architecture. GADS semantically groups facial landmarks into regional clusters and integrates lightweight Deep Sets layers with grouped multi-head attention to enable efficient cross-group feature fusion. Two inference paradigms are introduced: a vanilla landmark-only variant and a Hybrid-GADS variant that fuses RGB image features. The model achieves state-of-the-art (SOTA) accuracy on AFLW2000, BIWI, and 300W-LP while reducing parameter count to just 1/7.5 of the previous lightest SOTA method, accelerating inference by 25×, and compressing the best-performing model by 4321×. These gains underscore GADS’s effectiveness in balancing computational efficiency and estimation precision for resource-constrained deployment.
📝 Abstract
In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose extbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is $7.5 imes$ smaller and executes $25 imes$ faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being $4321 imes$ smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.