🤖 AI Summary
This study investigates the capability of large language models (LLMs) as autonomous collaborative writers in open-ended writing tasks—characterized by vast solution spaces and subjective success criteria—posing challenges in exploratory action, human alignment, and iterative optimization.
Method: We introduce the first evaluation framework specifically designed for autonomous writing agents in open-ended tasks, systematically decoupling the intertwined effects of action diversity, human alignment, and progressive improvement. Using Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o, we conduct comparative experiments integrating behavioral trajectory analysis, expert human evaluation, and multi-round iterative rewriting protocols.
Contribution/Results: Results demonstrate that synergistic high action diversity and strong human alignment significantly enhance textual evolution efficiency: expert scores improve by 23% on average across rewriting rounds. The framework thus establishes both empirical validity and theoretical value for evaluating autonomous agents in open-domain writing.
📝 Abstract
Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.