HAI-Eval: Measuring Human-AI Synergy in Collaborative Coding

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code evaluation benchmarks—whether human-graded or LLM-based—focus on isolated problem-solving and overlook human-AI collaboration, a rapidly emerging paradigm; thus, they fail to capture the complementary interplay between human reasoning and AI execution. Method: We introduce the first unified benchmark specifically designed to evaluate human-AI collaborative programming: (1) 45 task templates (“collaboration-necessity” tasks) with 450 instances, each solvable only through bidirectional interaction between human guidance and LLM execution; (2) a standardized IDE environment and reproducible LLM toolkit; and (3) a within-subject experimental design to ensure ecological validity. Contribution/Results: Pure LLM and pure human success rates are merely 0.67% and 18.89%, respectively, whereas human-AI collaboration boosts performance significantly to 31.11%, empirically validating synergistic gains and revealing novel forms of joint reasoning.

Technology Category

Application Category

📝 Abstract
LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.
Problem

Research questions and friction points this paper is trying to address.

Existing evaluation systems fail to measure human-AI collaboration in coding tasks.
Current benchmarks focus on algorithmic problems, excluding collaborative problem-solving scenarios.
There is no standardized framework to assess synergy between human reasoning and AI efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces HAI-Eval benchmark for human-AI coding synergy
Uses Collaboration-Necessary problem templates requiring joint reasoning
Provides standardized IDE and reproducible toolkit for evaluation
Hanjun Luo
Hanjun Luo
New York University Abu Dhbai
Trustworthy AILarge Language ModelText-to-Image
C
Chiming Ni
University of Illinois Urbana-Champaign
J
Jiaheng Wen
Harvard University
Z
Zhimu Huang
New York University Abu Dhabi
Y
Yiran Wang
University of Electronic Science and Technology of China
B
Bingduo Liao
Beijing University of Technology
S
Sylvia Chung
Zhejiang University
Y
Yingbin Jin
The Hong Kong Polytechnic University
X
Xinfeng Li
Nanyang Technological University
Wenyuan Xu
Wenyuan Xu
Professor, IEEE Fellow, Zhejiang University, College of EE
Wireless Network SecurityEmbedded System SecurityAnalog Cyber SecurityIoT Security
XiaoFeng Wang
XiaoFeng Wang
Chair, ACM SIGSAC
AI-Centered SecuritySystems Security and PrivacyHealthcare PrivacyIncentive Engineering
Hanan Salam
Hanan Salam
SMART lab @NYU Abu Dhabi / Co-founder of Women in AI
Artificial IntelligenceHuman-Machine InteractionHuman-Robot Interaction