Scholar

Dongchao Yang

Google Scholar ID: WNiojyAAAAAJ

Chinese University of Hong Kong

TTSTTAAudio CodecMulti-modal Audio Fundation Models

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

2,696

H-index

i10-index

Publications

Co-authors

list available

Contact

CVOpen ↗GitHubOpen ↗

Publications

20 items

Audio Interaction Model

2026

Cited

UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment

2026

Cited

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

2026

Cited

HeartMuLa: A Family of Open Sourced Music Foundation Models

2026

Cited

Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning

2025

Cited

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

2025

Cited

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

2025

Cited

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

2025

Cited

Resume (English only)

Academic Achievements

1. Diffsound: Discrete Diffusion Model for Text-to-sound generation, IEEE Transactions on Audio, Speech and Language Processing, 2023.
2. InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt, IEEE Transactions on Audio, Speech and Language Processing, 2024.
3. UniAudio: An Audio Foundation Model Toward Universal Audio Generation, ICML, 2024.
4. UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner, NIPS, 2024.
5. ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling, ICML, 2025.
6. SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models, IEEE Transactions on Audio, Speech and Language Processing (TASLP), 2025.
7. SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models, Proc. Interspeech, 2024.
8. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, ICML, 2023.
9. A Mixed Supervised Learning Framework for Target Sound Detection, DCASE Workshop, 2022.
10. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head, AAAI 2024, 2023.

Research Experience

1. July 2023 - Sep. 2023, MSRA, Speech Group, Intern, Supervisor: Xu Tan.
2. May 2021 - May 2023, Tencent AI Lab, Speech Group, Intern, Supervisors: Songxiang Liu, Chao Weng, and Bo Wu.

Education

1. The Chinese University of Hongkong, School of Electronic and Computer Engineering, PhD in progress, August 2023 - Now.
2. Peking University, School of Computer Engineering and Science, Master, August 2020 - August 2023.
3. Shanghai University, August 2016 - July 2020.

Background

Research interests include Audio Foundation Models, Generative Models, Large Language Models, and Audio/Speech Processing. Currently a PhD student at The Chinese University of Hongkong, supervised by Prof. Helen Meng. Received the Master's Degree from Peking University in 2023.

Miscellany