1. DeepSeek-VL: Towards Real-World Vision-Language Understanding
2. WenLan (悟道文澜): Bridging vision and language by large-scale multi-modal pre-training
3. VDT: General-purpose Video Diffusion Transformers via Mask Modeling
4. Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling
5. LGDN: Language-Guided Denoising Network for Video-Language Modeling
6. Bmu-moco: Bidirectional momentum update for continual video-language modeling
7. Towards artificial general intelligence via a multimodal foundation model
8. COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
9. Learning versatile neural architectures by propagating network codes
10. Compressed video contrastive learning
Research Experience
Works closely with Dr. Mingyu Ding at UC Berkeley and Prof. Bo Zhang at ZJU on research projects.
Education
Received B.E. degree in Computer Science from Renmin University of China in 2021; currently pursuing a Ph.D. at Renmin University of China, advised by Prof. Zhiwu Lu.
Background
Research interests: multimodal foundation model and video understanding. Currently a Ph.D. Student at Renmin University of China, advised by Prof. Zhiwu Lu.