A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in speech-text multimodal fusion—namely, excessively long audio sequences, high computational overhead, and dilution of salient speech information. We propose a lightweight speech token enhancement method: (1) discretizing raw speech into semantic speech tokens via an ASR tokenizer; (2) applying Lasso-based feature selection to identify task-critical tokens; (3) constructing a structured multimodal Bag-of-Words representation; and (4) incorporating a self-supervised language modeling objective for cross-modal alignment, followed by end-to-end fine-tuning. Without significantly increasing model parameters or computational cost, our approach achieves substantial performance gains on argumentative fallacy detection and other classification tasks. It outperforms unimodal LLMs, large-scale SpeechLMs, and state-of-the-art learnable audio fusion methods across multiple benchmarks, establishing new SOTA results. This validates the effectiveness of the “pruned token + structured fusion” paradigm for efficient and interpretable multimodal learning.

Technology Category

Application Category

📝 Abstract
This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
Problem

Research questions and friction points this paper is trying to address.

Integrating long speech token sequences into text-based language models efficiently
Selecting most relevant audio tokens for multimodal classification tasks effectively
Enhancing text-only models with speech information for argumentative fallacy detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lasso-based feature selection for audio tokens
Self-supervised language modeling adaptation
Integration of speech tokens into textual LLMs
🔎 Similar Papers
No similar papers found.
N
Nicolas Calbucura
Universidad de Chile, DCC
Valentin Barriere
Valentin Barriere
Telecom ParisTech
Affective Computing