Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition

📅 2024-10-13

🏛️ ACM Symposium on User Interface Software and Technology

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current real-time captioning systems render captions sentence-wise in static format, lacking word-level dynamic control over typographic attributes (e.g., capitalization, font size), thereby failing to convey prosodic cues—such as intonation, stress, and speaker intent—critical for accessibility and comprehension among deaf/hard-of-hearing individuals, second-language learners, and autistic users. This work introduces the first word-level adaptive captioning framework tailored for streaming ASR. It jointly models streaming speech recognition outputs and acoustic features (e.g., loudness) in real time to achieve millisecond-accurate word boundary alignment and acoustically driven, dynamic text styling. A prototype system demonstrates that loudness-to-font-size mapping significantly improves focal information transmission efficiency. User studies confirm substantial gains in comprehensibility, immersion, and accessibility. Our approach establishes a novel paradigm for intent-aware intelligent captioning.

Technology Category

Application Category

📝 Abstract

Rich-text captions are essential to help communication for Deaf and hard-of-hearing (DHH) people, second-language learners, and those with autism spectrum disorder (ASD). They also preserve nuances when converting speech to text, enhancing the realism of presentation scripts and conversation or speech logs. However, current real-time captioning systems lack the capability to alter text attributes (ex. capitalization, sizes, and fonts) at the word level, hindering the accurate conveyance of speaker intent that is expressed in the tones or intonations of the speech. For example, ''YOU should do this'' tends to be considered as indicating ''You'' as the focus of the sentence, whereas ''You should do THIS'' tends to be ''This'' as the focus. This paper proposes a solution that changes the text decorations at the word level in real time. As a prototype, we developed an application that adjusts word size based on the loudness of each spoken word. Feedback from users implies that this system helped to convey the speaker's intent, offering a more engaging and accessible captioning experience.

Problem

Research questions and friction points this paper is trying to address.

Enhancing real-time captions with word-level text attribute adjustments

Improving speaker intent conveyance through dynamic text decorations

Addressing limitations in current captioning systems for diverse user needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time word-level text attribute adjustment

Dynamic font size based on speech loudness

Enhanced captioning for intent conveyance

🔎 Similar Papers

No similar papers found.