🤖 AI Summary
This study addresses the challenge of precisely localizing forged words in speech where only a subset of utterances has been manipulated. To this end, the authors propose a novel speech forgery detection method leveraging large language models (LLMs) pretrained exclusively on text, which—through next-token prediction—achieves word-level localization of falsified content without any audio-specific training. By integrating speech–text alignment with deepfake detection, the approach demonstrates strong performance in identifying known editing styles, such as word-level polarity substitution, on the AV-Deepfake1M and PartialEdit benchmarks. The work further uncovers that LLMs rely heavily on learned editing patterns for their judgments, revealing both their potential and limitations: while effective for familiar manipulations, they struggle to generalize to unseen editing strategies.
📝 Abstract
Large language models (LLMs), trained on large-scale text, have recently attracted significant attention for their strong performance across many tasks. Motivated by this, we investigate whether a text-trained LLM can help localize fake words in partially fake speech, where only specific words within a speech are edited. We build a speech LLM to perform fake word localization via next token prediction. Experiments and analyses on AV-Deepfake1M and PartialEdit indicates that the model frequently leverages editing-style pattern learned from the training data, particularly word-level polarity substitutions for those two databases we discussed, as cues for localizing fake words. Although such particular patterns provide useful information in an in-domain scenario, how to avoid over-reliance on such particular pattern and improve generalization to unseen editing styles remains an open question.