๐ค AI Summary
This work addresses the limitation of existing Japanese speech large language models (SpeechLLMs), which, due to their integration of automatic speech recognition (ASR) encoders with text-based large language models, tend to generate overly formal, written-style text that inadequately captures spoken-language characteristics such as honorifics, sentence-final particles, and syntactic simplicity required for natural speech synthesis. To bridge this gap, the study proposes the first speech-appropriateness alignment framework explicitly designed for the spokenโwritten discrepancy in Japanese. The approach fine-tunes SpeechLLMs via Direct Preference Optimization (DPO) and introduces SpokenElyza, the first evaluation benchmark validated by native speakers through auditory assessment. Experimental results demonstrate that the proposed method significantly enhances spoken-language generation quality on SpokenElyza while preserving performance on conventional written-language tasks, thereby advancing Japanese spoken-dialogue systems.
๐ Abstract
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.