🤖 AI Summary
To address the dual challenges of low-quality ASR transcripts and limited discriminative capability of lightweight models in voice phishing (VP) detection, this paper proposes an expert-knowledge-driven lightweight VP detection framework. We perform supervised fine-tuning on Llama3-8B by explicitly injecting domain-specific VP assessment criteria as structured prompts, and introduce the first VP-specific adversarial evaluation benchmark alongside a multi-source transcription dataset. Experiments demonstrate that explicit expert prompt injection significantly outperforms chain-of-thought (CoT) reasoning, with the fine-tuned model achieving a 12.7% absolute accuracy gain over GPT-4 and markedly improved robustness. Our key contributions are: (1) the first empirical validation that explicit encoding of expert knowledge substantially enhances VP recognition in compact language models; (2) open-sourcing the first adversarial VP evaluation suite and diverse transcription corpus; and (3) enabling high-performance, robust real-time VP detection under resource-constrained conditions.
📝 Abstract
We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.