Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

To address the dual challenges of low-quality ASR transcripts and limited discriminative capability of lightweight models in voice phishing (VP) detection, this paper proposes an expert-knowledge-driven lightweight VP detection framework. We perform supervised fine-tuning on Llama3-8B by explicitly injecting domain-specific VP assessment criteria as structured prompts, and introduce the first VP-specific adversarial evaluation benchmark alongside a multi-source transcription dataset. Experiments demonstrate that explicit expert prompt injection significantly outperforms chain-of-thought (CoT) reasoning, with the fine-tuned model achieving a 12.7% absolute accuracy gain over GPT-4 and markedly improved robustness. Our key contributions are: (1) the first empirical validation that explicit encoding of expert knowledge substantially enhances VP recognition in compact language models; (2) open-sourcing the first adversarial VP evaluation suite and diverse transcription corpus; and (3) enabling high-performance, robust real-time VP detection under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract

We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting voice phishing using fine-tuned small language models

Evaluating LM robustness with adversarial test datasets

Comparing effectiveness of evaluation criteria and Chain-of-Thought techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning Llama3 for voice phishing detection

Using Chain-of-Thought and VP evaluation criteria

Creating adversarial test dataset for robustness

🔎 Similar Papers

On the Feasibility of Fully AI-automated Vishing Attacks