Joanna Hong
Scholar

Joanna Hong

Google Scholar ID: wqvP0D8AAAAJ
Google DeepMind
Audio ProcessingSpeech ProcessingLarge Language ModelMultimodal
Citations & Impact
All-time
Citations
522
 
H-index
10
 
i10-index
11
 
Publications
18
 
Co-authors
7
list available
Resume (English only)
Academic Achievements
  • Conference Papers:
  • 1. Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation, ACL 2024 (Oral)
  • 2. Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model, EMNLP 2023
  • 3. DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding, ICCV 2023
  • 4. Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring, CVPR 2023
  • 5. Lip-to-Speech Synthesis in the Wild with Multi-task Learning, ICASSP 2023
  • 6. VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection, ECCV 2022
  • 7. Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition, Interspeech 2022 (Oral)
  • 8. SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory, AAAI 2022 (Oral)
  • 9. Lip to Speech Synthesis with Visual Context Attentional GAN, NeuIPS 2021
  • 10. Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video, ICCV 2021
Research Experience
  • 1. Research Scientist at Google DeepMind, 2025 - present, advancing speech and audio capabilities for Audio Gemini.
  • 2. Member of Technical Staff at Trillion Labs, 2024 - 2025, contributed to the development of Trillion-7B and Tri-21B model, a multilingual large language model designed for practical, real-world applications.
  • 3. Research Scientist Intern at Meta Reality Labs, 2023 - 2024, worked on robust audiovisual representation learning with missing modality scenarios, enabling the recovery of absent information when only a single modality (e.g., audio or video) is available.
Education
  • Ph.D. in Electrical Engineering from KAIST, advised by Professor Yong Man Ro in the Integrated Vision Language Lab. Her thesis focused on human speech understanding through multimodal representation learning and was recognized with the Outstanding Dissertation Award from the School of Electrical Engineering.
Background
  • Research Interests: Building robust and scalable speech and audio technologies for human-AI interaction, including speech enhancement, separation, and speaker diarization. Also interested in multimodal learning that integrates audio, visual, and textual modalities to improve machine understanding.
Miscellany
  • No personal interests or hobbies provided.