FaceInsight: A Multimodal Large Language Model for Face Perception

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose multimodal large language models (MLLMs) exhibit limitations in facial perception tasks, including inaccurate responses and semantic distortions. To address these issues, we propose Face-MLLM—the first MLLM specifically designed for fine-grained facial understanding. Our approach introduces three key innovations: (i) facial-knowledge-driven vision–language alignment modeling, explicitly encoding deterministic and uncertain dependencies among facial features; (ii) integration of face segmentation maps as a structured auxiliary modality to enhance local perceptual capability; and (iii) support for both zero-shot and fine-tuning evaluation paradigms. Evaluated across three facial perception tasks, Face-MLLM consistently outperforms nine state-of-the-art baseline models under both zero-shot and fine-tuned settings, achieving new SOTA performance. It significantly improves the accuracy and robustness of facial semantic understanding while preserving structural and contextual fidelity.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.
Problem

Research questions and friction points this paper is trying to address.

Improves face perception in multimodal language models
Addresses inaccurate responses to face-specific queries
Enhances semantic understanding with face segmentation maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-textual alignment for facial knowledge
Face segmentation maps as auxiliary modality
Fine-grained facial information modeling
🔎 Similar Papers
No similar papers found.
Jingzhi Li
Jingzhi Li
University of Science and Technology Beijing
Face PrivacyTrustworthy AI
C
Changjiang Luo
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Ruoyu Chen
Ruoyu Chen
Institute of Information Engineering, Chinese Academy of Sciences.
Explainable AITrustworthy AIFoundation Model
H
Hua Zhang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
W
Wenqi Ren
School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University
J
Jianhou Gan
Key Laboratory of Education Informatization for Nationalities (Yunnan Normal University), Ministry of Education
Xiaochun Cao
Xiaochun Cao
Sun Yat-sen University
Computer VisionArtificial IntelligenceMultimediaMachine Learning