Published several papers, including 'Visual Perception by Large Language Model's Weights' (NeurIPS, 2024), 'Multi-Modal Generative Embedding Model' (Arxiv), 'Stare at What You See: Masked Image Modeling without Reconstruction' (CVPR, 2023), 'CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment' (ICLR, 2023), etc.
Research Experience
Worked as a researcher at ByteDance Research. Prior to that, he had experiences at NUS, Tencent WeChat, Shanghai AI Lab, and Microsoft Research Asia (MSRA). He was a main contributor to WeCLIP, a powerful multi-modal foundation model for various WeChat applications. Also contributed to PixelDance, a video generation model.
Education
Ph.D. from the University of Science and Technology of China (USTC), advised by Jiebo Luo and Houqiang Li; B.S. from the School of the Gifted Young, USTC.
Background
Research interests include Multi-Modal Learning, Computer Vision, and Machine Learning. Much of his research focuses on Vision-and-Language Pre-training.
Miscellany
His personal website provides more details about his projects and contact information.