Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the source attribution task for synthetic speech generation systems (STSGS), i.e., identifying the specific text-to-speech (TTS) system that generated a given utterance. Existing methods inadequately model source-specific paralinguistic cues—such as pitch, intonation, and rhythm—limiting discriminative capability. To address this, we introduce TRILLsson, a pre-trained paralinguistic representation, and propose TRIO: a novel framework that (i) fuses TRILLsson and x-vector embeddings via gated weighting, (ii) aligns multi-source feature distributions using canonical correlation analysis (CCA), and (iii) refines temporal modeling through self-attention. Extensive experiments on the STSGS benchmark demonstrate that TRIO significantly outperforms individual models and state-of-the-art fusion approaches, establishing new SOTA performance. Our results empirically validate the critical role of paralinguistic representations in synthetic speech attribution.

Technology Category

Application Category

📝 Abstract

In this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features--such as pitch, tone, rhythm, and intonation--into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations from SPTM pre-trained for paralinguistic speech processing, which excel in paralinguistic tasks like synthetic speech detection, speech emotion recognition has not been investigated for STSGS. We hypothesize that representations from paralinguistic SPTM will be more effective due to its ability to capture source-specific paralinguistic cues attributing to its paralinguistic pre-training. Our comparative study of representations from various SOTA SPTMs, including paralinguistic, monolingual, multilingual, and speaker recognition, validates this hypothesis. Furthermore, we explore fusion of representations and propose TRIO, a novel framework that fuses SPTMs using a gated mechanism for adaptive weighting, followed by canonical correlation loss for inter-representation alignment and self-attention for feature refinement. By fusing TRILLsson (Paralinguistic SPTM) and x-vector (Speaker recognition SPTM), TRIO outperforms individual SPTMs, baseline fusion methods, and sets new SOTA for STSGS in comparison to previous works.

Problem

Research questions and friction points this paper is trying to address.

Trace synthetic speech sources using paralinguistic features

Compare pre-trained models for speech source identification

Propose TRIO framework for superior source tracing performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses paralinguistic pre-trained representations

Proposes TRIO framework with gated fusion

Combines TRILLsson and x-vector models

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1