🤖 AI Summary
This work investigates human-AI collaborative decision-making, focusing on scenarios where humans underperform relative to large language models (LLMs) yet retain unique cognitive value. We propose a confidence-weighted, Bayesian-inspired ensemble framework that simplifies Bayesian fusion into a scalable logistic regression form. Our method explicitly models and empirically validates two necessary conditions for synergistic gain: calibration and diversity. Evaluated on a neuroscience prediction task, the human-LLM ensemble significantly outperforms either agent in isolation (mean accuracy increase statistically significant, *p* < 0.01), with robust performance across statistical tests. Our core contributions are threefold: (i) the first systematic identification of theoretical constraints governing effective human-LLM collaboration; (ii) a lightweight, interpretable, and deployable integration paradigm; and (iii) empirical confirmation that calibrated, diverse human and LLM judgments—when fused via confidence-weighted aggregation—yield consistent, statistically reliable improvements in collective accuracy.
📝 Abstract
Large language models (LLMs) have emerged as powerful tools in various domains. Recent studies have shown that LLMs can surpass humans in certain tasks, such as predicting the outcomes of neuroscience studies. What role does this leave for humans in the overall decision process? One possibility is that humans, despite performing worse than LLMs, can still add value when teamed with them. A human and machine team can surpass each individual teammate when team members' confidence is well-calibrated and team members diverge in which tasks they find difficult (i.e., calibration and diversity are needed). We simplified and extended a Bayesian approach to combining judgments using a logistic regression framework that integrates confidence-weighted judgments for any number of team members. Using this straightforward method, we demonstrated in a neuroscience forecasting task that, even when humans were inferior to LLMs, their combination with one or more LLMs consistently improved team performance. Our hope is that this simple and effective strategy for integrating the judgments of humans and machines will lead to productive collaborations.