Scholar

Xiaohan Wang

Google Scholar ID: iGA10XoAAAAJ

Stanford University

Computer VisionVideo UnderstandingLarge Multimodal Models

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

2,569

H-index

i10-index

Publications

Co-authors

Contact

TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

7 items

AutoMem: Automated Learning of Memory as a Cognitive Skill

2026

Cited

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

2026

Cited

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

2026

Cited

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

2026

Cited

Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

2026

Cited

Tool Verification for Test-Time Reinforcement Learning

2026

Cited

RadDiff: Describing Differences in Radiology Image Sets with Natural Language

arXiv.org · 2026

Cited

Resume (English only)

Academic Achievements

Publications:
- Temporal Preference Optimization for Long-Form Video Understanding (2025)
- Apollo: An Exploration of Video Understanding in Large Multimodal Models (2024)
- Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning (2025)
- Video Action Differencing (2025)
- Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration (2024)
Talks and Project Releases:
- Gave a talk at 'What is Next in Video Understanding' workshop @ CVPR 2024
- Released Temporal Preference Optimization (TPO) framework
- Released Apollo project
- VLM Classifier accepted to NeurIPS 2024
- VideoAgent accepted to ECCV 2024
- VisDiff accepted as an oral presentation at CVPR 2024
- RLCF accepted by ICLR 2024

Research Experience

Collaborated with researchers at Baidu Research and Facebook AI Research during Ph.D. studies. Currently working at Stanford University with Prof. Serena Yeung.

Education

Ph.D.: University of Technology Sydney, advised by Prof. Yi Yang; B.E.: University of Science and Technology of China.

Background

Research interests include Video Understanding, Multimodal Learning, and AI for Healthcare. Currently a Postdoc at Stanford University, affiliated with MARVL and Stanford AI Lab.

Co-authors

0 total

Co-authors: 0 (list not available)