Paper 'SemProTokenizer: Streamlined Dual-Branch Speech Tokenizer for Spoken Language Models', Authors: Se Jin Park, Bella Godiva, Jeonghun Yeo, Junil Won, and Yong Man Ro, Under Review.
Paper 'Long-Form Speech Generation with Spoken Language Models', Authors: Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, and RJ Skerry-Ryan, Published at ICML 2025, Oral Presentation.
Paper 'MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens', Authors: Jeonghun Yeo, Hyeongseop Rha, Se Jin Park, and Yong Man Ro, Published at ACL Findings 2025.
Paper 'Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation', Authors: Minsu Kim, Jeonghun Yeo, Se Jin Park, Hyeongseop Rha, and Yong Man Ro, Published at ACMMM 2024.
Paper 'Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation', Authors: Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Man Ro, Published at ACL 2024, Received Outstanding Paper Award.
Paper 'AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation', Authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, and Yong Man Ro, Published at CVPR 2024, Highlight Presentation.
Paper 'Persona Extraction Through Semantic Similarity For Emotional Support Conversation Generation', Authors: Seunghee Han, Se Jin Park, Chae Won Kim, and Yong Man Ro, Published at ICASSP 2024.
Paper 'Exploring Phonetic Context in Lip Movement for Authentic Talking Face Generation', Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, and Yong Man Ro, Published at ICASSP 2024.
Paper 'Reprogramming Audio-driven Talking Face Synthesis into Text-driven', Authors: Jeongsoo Choi, Minsu Kim, Se Jin Park, and Yong Man Ro, Published at ICASSP 2024.
Education
Ph.D. candidate at KAIST, Integrated Vision and Language Lab (IVLLab), Advisor: Professor Yong Man Ro, Time: Final year of Ph.D. program.
Background
A final-year Ph.D. student at KAIST, Integrated Vision and Language Lab (IVLLab), advised by Professor Yong Man Ro. Research focuses on advancing multimodal human-AI interactions—integrating audio, vision, and text—based on Large Language Models. Specifically, worked on multimodal integration, unbounded generation, generation fidelity, and full-duplex behaviors for realistic and engaging human-AI dialogues.