Scholar

Ziyu Guo

Google Scholar ID: S9GLetwAAAAJ

The Chinese University of Hong Kong

Multi-modality LearningLLM/VLMs3D Vision

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

3,872

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailguoziyu86@gmail.com GitHubOpen ↗LinkedInOpen ↗

Publications

32 items

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

2026

Cited

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

2026

Cited

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

2026

Cited

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

2026

Cited

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

2026

Cited

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

2026

Cited

GENIUS: Generative Fluid Intelligence Evaluation Suite

2026

Cited

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

2025

Cited

Resume (English only)

Academic Achievements

- CoT/CoF Reasoning for Visual Generation (arXiv)
- Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark (Technical Report)
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (Under Review)
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT (NeurIPS 2025)
- Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO (NeurIPS 2025)
- SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems (ACL 2025)
- MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency (ICML 2025)
- MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine (ICLR 2025)
- MathVerse: Does your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (ECCV 2024)
- MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines (ICLR 2025)
- Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following (Under Review)
- Exploring the Potential of Encoder-free Architectures in 3D LMMs (Under Review)
- SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners (Technical Report)
- PointCLIP: Point Cloud Understanding (CVPR 2022)

Research Experience

- Research Intern at Meta
- Research Intern at Amazon AWS AI Lab
- Research Intern at Roblox
- Research Intern at Tencent
- Research Intern at Shanghai AI Laboratory

Education

- Ph.D. Candidate, The Chinese University of Hong Kong, Department of Computer Science and Engineering, Supervisor: Prof. Pheng-Ann Heng
- Bachelor’s Degree, Peking University, Computer Science, Supervisor: Prof. Bin Cui

Background

- Research Interests: Multi-modal Learning, Large Language/Vision Models, and 3D Vision
- Professional Field: Computer Science and Engineering

Co-authors

2 total

Pheng Ann Heng

Choh-Ming Li Professor of Computer Science and Engineering, The Chinese University of Hong Kong

Bin CUI

Professor of Computer Science, Peking University