Scholar

Mingfei Han

Google Scholar ID: wJEoIXsAAAAJ

MBZUAI; University of Technology Sydney; Bytedance Seed; MMLab, SIAT

Object RecognitionVideo UnderstandingVision Language ModelsRobotics

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

1,047

H-index

i10-index

Publications

Co-authors

list available

Contact

CVOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

21 items

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

2026

Cited

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

2026

Cited

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

2026

Cited

GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

2026

Cited

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

2026

Cited

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

2026

Cited

Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

2026

Cited

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

2025

Cited

Resume (English only)

Academic Achievements

Won first place awards in two tracks at IROS 2025; two papers accepted to NeurIPS 2025; RoomTour3D project showcased at CVPR 2025; Shot2Story project presented at ICLR 2025; LongVLM paper presented orally at ECCV 2024; published multiple papers including 'Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection' and 'RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation'.

Research Experience

Currently a postdoctoral researcher at Mohamed Bin Zayed University of Artificial Intelligence. Worked closely with Heng Wang, Linjie Yang, and Xiaojie Jin on various video-language projects at Bytedance Seed.

Education

Ph.D. from the University of Technology Sydney, advised by Prof. Xiaojun Chang; Master's degree from the University of Chinese Academy of Sciences (UCAS); Bachelor's degree from Nankai University (NKU) with graduate honors. Spent two years at Monash University and was a visiting student at MMLab, SIAT, Chinese Academy of Sciences, where I worked with Prof. Yu Qiao and Prof. Yali Wang.

Background

My research interests lie at the intersection of computer vision and robotics, particularly in large vision–language models, video summarization, and analyzing their hallucination behavior. Recent work spans video–language understanding, with a focus on long video understanding, video grounding tasks such as Referring Video Object Segmentation, and vision–language navigation and manipulation for robots.

Co-authors

9 total