🤖 AI Summary
This work addresses the limitations of existing synthetic image detection methods, which predominantly rely on end-to-end classification or single-mode reasoning and struggle to model structured forensic reasoning and heterogeneous visual evidence. To overcome these challenges, the authors propose a cognition-inspired multi-skill reasoning framework that decomposes detection into an explicit, configurable sequence of cognitive skills: first extracting perceptual clues, then selecting optimal forensic skills, and finally performing evidence extraction and decision-making through a skill-guided toolchain. Built upon a two-stage agent architecture integrating clue-driven heuristics and evidence-guided reasoning, the study also introduces ClueAegis-Bench, a new evaluation benchmark. Experiments demonstrate that the proposed approach achieves state-of-the-art performance across multiple metrics, significantly enhancing cross-domain generalization and robustness while yielding interpretable reasoning trajectories and structured forensic evidence.
📝 Abstract
The rapid advancement of generative models has made synthetic images increasingly realistic, challenging reliable detection. Existing methods are often limited to end-to-end classification or monolithic reasoning, and thus fail to model structured forensic reasoning and heterogeneous visual evidence. We revisit synthetic image detection from a cognitive perspective and propose a \textit{Heuristic-to-Reasoning} cognitive skill learning framework for evidence-based forensic analysis. Given an input image, our framework first extracts heuristic perceptual clues, selects the optimal forensic skill, and then performs skill-conditioned reasoning for evidence extraction and decision making. To support this paradigm, we introduce \textbf{ClueAegis-Bench}, which decomposes synthetic image detection into explicitly annotated forensic cognitive skills for structured evaluation beyond binary classification. Based on this benchmark, we propose \textbf{ClueAegis} (\underline{C}ognitive-skill \underline{L}earning for \underline{U}nified \underline{E}vidence-based Synthetic Image Detection), a two-stage agentic framework that conducts heuristic skill selection followed by evidence-guided reasoning through skill-conditioned toolchains. This design reformulates synthetic image detection as a configurable multi-skill reasoning process that bridges perception, skill selection, and forensic reasoning. Extensive experiments show that ClueAegis achieves state-of-the-art performance while improving cross-domain generalization and robustness. It also provides transparent reasoning trajectories and structured forensic evidence, offering a more explainable alternative to conventional end-to-end detectors.