Scholar

Minsu Kim

Google Scholar ID: TXB0FyoAAAAJ

Google DeepMind

Multimodal LearningAudio-Visual Speech ProcessingGenerative AI

Citations & Impact

All-time

Citations

929

H-index

i10-index

Publications

Co-authors

list available

Contact

Publications

2 items

2026

Cited

2026

Cited

Resume (English only)

Academic Achievements

International Journal Publications:
- TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
- Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
- Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation
- AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
- Cromm-vsr: Cross-modal memory augmented visual speech recognition
- Speech Reconstruction with Reminiscent Sound via Visual Voice Memory
International Conference Papers:
- MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
- Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
- Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
- Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
- Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
- Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
- Large Language Models are Strong Audio-Visual Speech Recognition Learners

Research Experience