Asia-Pacific Signal and Information Processing Association Annual Summit and Conference · 2025
Cited
0
Resume (English only)
Academic Achievements
ICLR 2025: Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
ICLR 2025: GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
Preprint: Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Preprint: Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
ICASSP 2025: SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
NeurIPS 2024: Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
ACL 2024 (Oral): GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators
ACL 2024: Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
ACL 2024: Overcoming Catastrophic Forgetting by Exemplar Selection in Task-oriented Dialogue System
ICLR 2024 (Spotlight, Top 5%): Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
ICLR 2024: It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
NeurIPS 2023: HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
AAAI 2024: Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-modal Speech Representation
ICASSP 2024: Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection
ICASSP 2024: An Experimental Comparison of Noise-Robust Text-To-Speech Synthesis Systems Based On Self-Supervised Representation
ACL 2023 (Oral): Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and
Research Experience
Currently a final-year Ph.D. student at the School of Computer Science and Engineering, Nanyang Technological University, working on speech and multimodal related research.
Background
Research interests include full-duplex speech dialog systems, text-to-speech synthesis (RLHF, streaming), generative seq2seq learning; speech recognition/translation/enhancement, efficient adaptation of foundation models; multimodal: video-to-audio generation, audio-visual understanding.