VANPY: Voice Analysis Framework

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of automated tools and insufficient capability in modeling multi-attribute speaker representations in speech analysis, this paper introduces VANPY, an open-source end-to-end speech analysis framework. VANPY integrates voice preprocessing (including voice activity detection and speaker separation), multi-dimensional feature extraction (time-frequency representations, OpenSMILE features, and deep embeddings), and joint speaker representation learning. It innovatively unifies four self-developed models for gender, age, height, and emotion—supporting three-dimensional quantification of emotional intensity along the arousal–dominance–valence (ADV) axes. Built upon PyTorch, SpeechBrain, and Librosa, VANPY is architecture-agnostic, supporting CNNs, RNNs, and Transformers. Evaluated on diverse multi-source datasets and real-world speech from *Pulp Fiction*, VANPY achieves synchronous multi-attribute prediction, demonstrating significant improvements in cross-task generalization and scalability.

Technology Category

Application Category

📝 Abstract
Voice data is increasingly being used in modern digital communications, yet there is still a lack of comprehensive tools for automated voice analysis and characterization. To this end, we developed the VANPY (Voice Analysis in Python) framework for automated pre-processing, feature extraction, and classification of voice data. The VANPY is an open-source end-to-end comprehensive framework that was developed for the purpose of speaker characterization from voice data. The framework is designed with extensibility in mind, allowing for easy integration of new components and adaptation to various voice analysis applications. It currently incorporates over fifteen voice analysis components - including music/speech separation, voice activity detection, speaker embedding, vocal feature extraction, and various classification models. Four of the VANPY's components were developed in-house and integrated into the framework to extend its speaker characterization capabilities: gender classification, emotion classification, age regression, and height regression. The models demonstrate robust performance across various datasets, although not surpassing state-of-the-art performance. As a proof of concept, we demonstrate the framework's ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie"Pulp Fiction."The results illustrate the framework's capability to extract multiple speaker characteristics, including gender, age, height, emotion type, and emotion intensity measured across three dimensions: arousal, dominance, and valence.
Problem

Research questions and friction points this paper is trying to address.

Automated voice analysis framework
Speaker characterization from voice data
Extensible voice analysis applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source voice analysis framework
Extensible with new components
Integrated speaker characterization models
🔎 Similar Papers
No similar papers found.
G
Gregory Koushnir
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
Michael Fire
Michael Fire
Faculty of Computer and Information Science, The Fire AI Lab, BGU
Cyber SecurityApplied AISafe AIData ScienceBig Data
G
G. Alpert
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
Dima Kagan
Dima Kagan
PhD Software and Information System Engineering , Ben Gurion University
Social NetworksData MiningInformation SecurityMachine Learning