Scholar

Zhiding Yu

Google Scholar ID: 1VI_oYUAAAAJ

Principal Research Scientist & Research Lead, NVIDIA Research

Computer VsionDeep Learning

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

24,839

H-index

i10-index

Publications

Co-authors

list available

Contact

CVOpen ↗TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

33 items

Vesta: A Generalist Embodied Reasoning Model

2026

Cited

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

2026

Cited

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

2026

Cited

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

2026

Cited

Stateful Token Reduction for Long-Video Hybrid VLMs

2026

Cited

PhyCritic: Multimodal Critic Models for Physical AI

2026

Cited

Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

2026

Cited

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

2026

Cited

Resume (English only)

Academic Achievements

Winner, CVPR24 Challenge on End-to-End Driving at Scale (Hydra-MDP).
2nd Place, CVPR24 Challenge on Driving with Language.
Winner, CVPR23 Challenge on 3D Occupancy Prediction (FB-BEV/FB-OCC).
Winner, ECCV22 Robust Vision Challenge (RVC) on Semantic Segmentation.
Winner, CVPR18 Autonomous Driving Challenge (WAD) on Domain Adaptation.
2nd Place, ICMI15 EmotiW Challenge on Static Facial Expression Recognition.
Best Paper Award, BMVC 2020.
Best Paper Award, WACV 2015.
Best Student Paper Award, ISCSLP 2014.
Most Influential NeurIPS Paper Award (SegFormer).
Numerous publications listed on Google Scholar.

Background

Principal Research Scientist & Research Lead at the Learning & Perception Research Group, NVIDIA Research.
Interested in building general autonomy and intelligence across virtual and physical domains.
Recent focus includes Vision Transformers, LLMs, multimodal LLMs, and vision-language-action (VLA) models.
Applications span open-world understanding, reasoning, AV/robot perception-planning, and agentic systems.
Works are characterized by state-of-the-art performance, scalable architectures, and data-centric strategies for real-world generalization.

Co-authors

15 total