From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited cross-domain generalization of existing multimodal large language model (MLLM)-based face anti-spoofing (FAS) methods, which rely on coarse-grained semantic descriptions and struggle to capture fine-grained forgery cues. To overcome this, we propose TAR-FAS, a novel framework that reformulates FAS as a chain-of-thought reasoning process augmented with visual tools, enabling the model to start from intuitive observations and adaptively invoke external tools for in-depth analysis of subtle spoofing artifacts. We introduce the first tool-augmented multi-turn reasoning paradigm for FAS, construct ToolFAS-16K—a dataset containing tool usage trajectories—and design a Diverse-Tool Group Relative Policy Optimization (DT-GRPO) algorithm to enable efficient and autonomous invocation of diverse visual tools. Under the challenging 1-vs-11 cross-domain protocol, TAR-FAS achieves state-of-the-art performance, significantly improving both generalization and interpretability.

Technology Category

Application Category

📝 Abstract

Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.

Problem

Research questions and friction points this paper is trying to address.

Face Anti-Spoofing

cross-domain generalization

fine-grained visual patterns

presentation attacks

MLLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Augmented Reasoning

Chain-of-Thought with Visual Tools

Face Anti-Spoofing