Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based speech translation systems primarily focus on input-output modality alignment, neglecting deep semantic consistency between internal speech and text representations. Method: We propose an adaptive inner-layer speech-text alignment method that explicitly models cross-modal semantic alignment within LLM hidden layers. Innovatively integrating optimal transport (OT) theory with cross-modal retrieval, we design a hidden-layer selection mechanism that dynamically identifies and jointly optimizes the optimal alignment layers for fine-grained, adaptive internal representation alignment. Contribution/Results: Our approach significantly improves the performance of large speech-to-text models (LSMs) on speech translation tasks, comprehensively outperforming current state-of-the-art methods across multiple benchmarks. The OT-guided layer selection enables principled, interpretable, and task-aware alignment without architectural modifications or additional inference latency.

Technology Category

Application Category

📝 Abstract
Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.
Problem

Research questions and friction points this paper is trying to address.

Addresses semantic alignment in LLM-based speech translation
Proposes AI-STA to bridge speech-text modality gap
Improves translation performance using optimal transport theory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Inner Speech-Text Alignment method
Optimal transport theory for representation discrepancies
Cross-modal retrieval for layer alignment
H
Henglyu Liu
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
A
Andong Chen
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
Kehai Chen
Kehai Chen
Harbin Institute of Technolgy (Shenzhen)
LLMNatural Language ProcessingAgentMulti-model Generation
Xuefeng Bai
Xuefeng Bai
Harbin Institute of Technology (Shenzhen)
Natural language processingSemanticsDialogue
M
Meizhi Zhong
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China
Yuan Qiu
Yuan Qiu
Southeast University
Differential PrivacyDatabaseSketches
M
Min Zhang
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China