Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

๐Ÿ“… 2024-11-21
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Cross-modal alignment between automatic speech recognition (ASR) and large language models (LLMs) remains challenging for resource-constrained edge devices due to modality isolation and computational inefficiency. Method: This paper proposes the first lightweight joint alignment framework, featuring a parameter-efficient cross-modal adapter, gradient sparsification during training, and an edge-native fine-tuning strategy. Contribution/Results: The framework bridges the ASRโ€“LLM semantic gap under stringent low-resource constraints while enabling on-device personalized continual learning and real-time online adaptation. It achieves end-to-end deployment optimization on NVIDIA Jetson platforms. Evaluated on Jetson Orin (8 GB RAM), the framework accelerates training by 50ร— and improves cross-modal alignment fidelity by over 50% compared to baseline methods. These advances significantly enhance the practicality and deployment readiness of intelligent, speech-driven edge applications.

Technology Category

Application Category

๐Ÿ“ Abstract
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
Problem

Research questions and friction points this paper is trying to address.

Enables efficient ASR-LLM alignment on edge devices
Addresses personalized audio input challenges for edge deployment
Reduces computational demands for cross-modal alignment training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resource-efficient cross-modal ASR-LLM alignment framework
Enables personalized on-device training for edge ASR-LLM
Achieves 50x speedup with 50% alignment quality improvement
๐Ÿ”Ž Similar Papers
No similar papers found.