Sep 25, 2025: Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents released on arXiv.
Aug 20, 2025: Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation accepted to EMNLP 2025!
May 20, 2025: STAR and SSR accepted to IWSLT 2025!
Sep 25, 2024: DiffNorm accepted to NeurIPS 2024!
Paper SSR: Alignment-Aware Modality Connector for Speech Language Models proposes a method to better fuse modalities by segmenting and compressing speech features.
Paper DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation introduces a diffusion-based normalization strategy to simplify data distributions.
Research Experience
Recently focused on allowing multi-modal agents to seamlessly reason, utilize tools, and interact with users/envs. Also working on unified multimodal understanding and generation LLMs.
Education
Currently a PhD Candidate in Computer Science at Johns Hopkins University, advised by Prof. Philipp Koehn. Previously, completed Undergraduate and Master’s degrees in Computer Science at JHU.
Background
Research interests: machine learning and natural language processing; particularly excited by efficient and scalable representation learning methods for cross-modal applications; also interested in derivatives pricing and hedging strategies.