🤖 AI Summary
To address the poor robustness of facial landmark detection under occlusion, extreme illumination, and large pose variations, this paper proposes a novel Transformer-based method. The core innovation is the introduction of learnable “Messenger Tokens” that explicitly model feature compensation from visible regions to occluded ones, enabling consistent modeling of local patches and global context while supporting automatic occlusion identification and feature recovery. The method integrates self-attention–based feature aggregation, heatmap generation, and landmark regression. Evaluated on challenging benchmarks—including WFLW and COFW—it significantly outperforms state-of-the-art methods. The generated heatmaps exhibit strong occlusion invariance, and the overall framework substantially improves both accuracy and robustness of facial landmark detection.
📝 Abstract
Although facial landmark detection (FLD) has gained significant progress, existing FLD methods still suffer from performance drops on partially non-visible faces, such as faces with occlusions or under extreme lighting conditions or poses. To address this issue, we introduce ORFormer, a novel transformer-based method that can detect non-visible regions and recover their missing features from visible parts. Specifically, ORFormer associates each image patch token with one additional learnable token called the messenger token. The messenger token aggregates features from all but its patch. This way, the consensus between a patch and other patches can be assessed by referring to the similarity between its regular and messenger embeddings, enabling non-visible region identification. Our method then recovers occluded patches with features aggregated by the messenger tokens. Leveraging the recovered features, ORFormer compiles high-quality heatmaps for the downstream FLD task. Extensive experiments show that our method generates heatmaps resilient to partial occlusions. By integrating the resultant heatmaps into existing FLD methods, our method performs favorably against the state of the arts on challenging datasets such as WFLW and COFW.