🤖 AI Summary
This study investigates the adaptability and task-allocation potential of large language models (LLMs) in complex human-AI collaborative crowdsourcing pipelines, moving beyond atomic tasks to examine end-to-end, multi-stage computational workflows.
Method: We propose a “human-AI complementary skill training” framework—the first systematic evaluation of LLM capabilities within real-world crowdsourcing pipelines—employing prompt engineering, multi-stage task decomposition, comparative experiments across interaction modalities (e.g., text-only vs. structured feedback), and cross-analysis of human and LLM performance.
Contribution/Results: We find that LLM subtask success rates critically depend on skill type, instruction comprehension fidelity, and interaction design. LLMs robustly perform specific high-fit subtasks (e.g., information extraction, preliminary filtering), yet human oversight remains essential for logical closure, accountability attribution, and dynamic coordination. Crucially, LLM outputs exhibit high sensitivity to minor instruction perturbations, underscoring the necessity of safety-aware, co-design principles for reliable human-AI collaboration.
📝 Abstract
LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these ``human computation algorithms,'' but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks. We reflect on human and LLMs' different sensitivities to instructions, stress the importance of enabling human-facing safeguards for LLMs, and discuss the potential of training humans and LLMs with complementary skill sets. Crucially, we show that replicating crowdsourcing pipelines offers a valuable platform to investigate 1) the relative LLM strengths on different tasks (by cross-comparing their performances on sub-tasks) and 2) LLMs' potential in complex tasks, where they can complete part of the tasks while leaving others to humans.