Scholar

Gengyuan Zhang

Google Scholar ID: LN2tYr0AAAAJ

LMU Munich, MCML

Multimodal learningVideo UnderstandingVision-Language Model

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

377

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailgengyuanmax@gmail.com CVOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

7 items

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

2025

Cited

AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction

2025

Cited

My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals

2025

Cited

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

2025

Cited

Perceive, Query&Reason: Enhancing Video QA with Question-Guided Temporal Queries

2024

Cited

FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

arXiv.org · 2024

Cited

Multimodal Pragmatic Jailbreak on Text-to-image Models

arXiv.org · 2024

Cited

Resume (English only)

Academic Achievements

One paper accepted by ICLR 2025 Workshop World Model; two papers accepted at CVPR 2025; a new paper on arXiv titled 'Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs'; one new paper accepted by WACV 2025.

Research Experience

Starting an internship at Amazon London; previous research involved video understanding and multimodal queries.

Education

Bachelor's degree (2018) from Zhejiang University, China; Master's degree (2021) from Technical University of Munich, Germany; Currently pursuing a PhD at Ludwig-Maximilian University (LMU Munich/University of Munich), supervised by Prof. Volker Tresp.

Background

Research interests include Video Understanding and Multimodal Reasoning, at the intersection of Computer Vision and Natural Language Processing. Originally from Hunan, China.

Miscellany

Hobbies include plants, Crusader Kings III, traveling, cooking; has a cute dachshund; open to any collaboration and full-time job opportunities.

Co-authors

3 total

Volker Tresp

Ludwig-Maximilians-Universität München (LMU Munich)

Jindong Gu

Google Research & DeepMind, University of Oxford

Philip Torr

Professor, University of Oxford