Scholar

Hang Hua

Google Scholar ID: K9aLTwUAAAAJ

University of Rochester

Computer VisionNatural Language ProcessingMachine Learning

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

714

H-index

i10-index

Publications

Co-authors

list available

Contact

GitHubOpen ↗LinkedInOpen ↗

Publications

22 items

JobBench: Aligning Agent Work With Human Will

2026

Cited

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

2026

Cited

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

2026

Cited

Aurora: Unified Video Editing with a Tool-Using Agent

2026

Cited

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

2026

Cited

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

2026

Cited

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

2026

Cited

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

2026

Cited

Resume (English only)

Academic Achievements

Published several papers, including 'MMComposition: Revisiting the Compositionality of Pre-trained Vision-Language Models' (ArXiv 2024) and 'PROMPTCAP: Prompt-Guided Image Captioning for VQA with GPT-3' (ICCV 2023). Also involved in multiple projects, such as developing new diagnostic benchmarks to assess MLLMs' capabilities and designing new MLLMs with enhanced competencies.

Research Experience

Currently a research scientist at MIT-IBM Watson AI Lab. Conducted research under the guidance of Prof. Jiebo Luo during his PhD studies at the University of Rochester.

Education

PhD from the University of Rochester, advised by Prof. Jiebo Luo (Fellow of ACM/AAAI/IEEE/NAI/AIMBE/IAPR/SPIE); Master's degree from Peking University; Bachelor’s degree from South China University of Technology.

Background

Research Interests: Generative AI, particularly Multimodal LLMs (MLLMs) and Pre-trained Language Models (PLMs). Focuses on addressing core limitations such as Compositionality, Fine-grained Visual Perception, Robustness, and Reasoning.

Miscellany