CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech relation extraction (SpeechRE) faces two key bottlenecks: scarcity and limited diversity of real-world speech data, and models’ overreliance on single-sequence generation templates with weak cross-modal semantic alignment. To address these, we introduce CommonVoice-SpeechRE—the first large-scale, real-speech SpeechRE benchmark comprising nearly 20,000 utterances from diverse human speakers. We further propose RPG-MoGe, a novel end-to-end generative framework featuring: (i) a multi-stage triplet generation ensemble strategy; (ii) a CNN-based implicit relation prediction head; and (iii) an explicit relation prompting mechanism to significantly enhance speech–text cross-modal alignment—thereby overcoming rigid single-template constraints. Extensive experiments demonstrate that RPG-MoGe substantially outperforms state-of-the-art methods on real-speech benchmarks. Both the code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.
Problem

Research questions and friction points this paper is trying to address.

Lack of real diverse speech data for relation extraction
Rigid single-order generation templates in existing models
Weak semantic alignment affecting triplet extraction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale real-human speech dataset
Multi-order triplet generation ensemble strategy
CNN-based relation prompts for alignment
J
Jinzhong Ning
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
P
Paerhati Tulajiang
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China; and College of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China
Y
Yingying Le
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Y
Yijia Zhang
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
Y
Yuanyuan Sun
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hongfei Lin
Hongfei Lin
DalianUniversity of Technology
natural language processing,sentimental analysistext miningsocial computing
Haifeng Liu
Haifeng Liu
Zhejiang University
Machine LearningData ManagementInformaiton Retrieval