Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? (arXiv 2024)
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (ICLR 2025)
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset (ICLR 2025)
Multimodal Autoregressive Pre-training of Large Vision Encoders (CVPR 2025)
VinVL: Making Visual Representations Matter in Vision-Language Models (CVPR 2021)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks (ECCV 2020)
Robust navigation with language pretraining and stochastic sampling (EMNLP 2019)
End-to-End Task-Completion Neural Dialogue Systems (IJCNLP 2017)
Composite Task-Completion Dialogue System via Hierarchical Deep Reinforcement Learning (EMNLP 2017)
Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning (ACL 2018)
Research Experience
Worked five years at Microsoft Research before joining Apple. Research experience spans dialog, deep reinforcement learning, NLP, vision and language, and multimodal LLMs.
Education
PhD from UW CSE in 2024, advised by Yejin Choi.
Background
Research interests include multimodal LLMs, LLMs, NLP, vision and language. Currently a Research Scientist at Apple. Also interested in video generation.