Scholar

Jeong Hun Yeo

Google Scholar ID: PJoYv2cAAAAJ

Korea Advanced Institute of Science and Technology

Audio-Visual Speech RecognitionMultimodal Learning

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

273

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailsedne246@kaist.ac.kr GitHubOpen ↗LinkedInOpen ↗

Publications

9 items

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

2026

Cited

Diffusion Large Language Models for Visual Speech Recognition

2026

Cited

Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

2025

Cited

GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

2025

Cited

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

2025

Cited

Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding

2025

Cited

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

2025

Cited

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

2025

Cited

Resume (English only)

Academic Achievements

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model, IEEE Transactions on Multimedia (TMM), 2024
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations, IEEE/CVF International Conference on Computer Vision (ICCV), 2025
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL Findings), 2025
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language, The Association for the Advancement of Artificial Intelligence (AAAI) 2025
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing, Empirical Methods in Natural Language Processing (EMNLP) 2024 Findings
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation, The Association for Computing Machinery's Annual Conference on Multimedia (ACMMM), 2024
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) Oral Presentation, 2024
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge, IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Multi-Temporal Lip-Audio Memory for Visual Speech Recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Background

Ph.D. Candidate at KAIST Integrated Vision & Language Lab, with research interests including audio knowledge empowered visual speech recognition, zero-shot audio-visual speech recognition, etc.

Miscellany