MindCopilot: Towards Formalizing and Evaluating Granular Human-LLM Co-Writing

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses a critical gap in existing active human-AI collaborative writing systems: the lack of effective evaluation of users’ real-time acceptance, editing, or correction of AI suggestions, as output quality alone fails to capture true usability. The authors model collaborative writing as a human-in-the-loop Markov decision process and introduce an interaction-driven formal framework for the writing workflow. They further propose Co-Writing Fidelity Suite—the first interaction-aware evaluation framework—featuring hierarchical acceptance rates and knowledge-aware edit distance to quantify alignment between user and assistant and to measure cognitive editing burden. Validated through large-scale simulations across 16 domains and a user study with 30 participants, the results demonstrate that interaction structures systematically influence suggestion acceptance and editing costs, and that the proposed metrics significantly outperform conventional output-quality-only evaluations.

📝 Abstract

Recent writing assistants are increasingly shifting from passive, prompt-driven interaction to proactive, suggestion-based completion, which integrates localized continuations into the writing flow and reduces coordination burden. However, existing evaluations simply focus on output quality, failing to capture how users accept, edit, or repair suggestions in real-time interaction, and thus obscuring the true usability of proactive co-writing systems. To address this gap, we adopt a sequential, behavior-centered view of interactive writing and formalize co-writing as a Human-in-the-Loop Markov Decision Process, modeling writing as an interaction shaped by user acceptance and editing decisions. Based on this formulation, we introduce the Co-Writing Fidelity Suite, an interaction-aware metric suite that captures both user-assistant alignment and cognitive editing effort, including Hierarchical Acceptance Rate and Knowledge-aware Editing Distance. We conduct a large-scale simulation study across 16 writing domains, using 1,688 controlled continuation queries sampled from different writing stages. Our analysis reveals systematic effects of interaction structure on acceptance behavior and editing cost. A follow-up user study with 30 participants confirms that these behavioral patterns align with real user experience. Together, our findings demonstrate that interaction-aware evaluation provides insights beyond output-only metrics and informs the design of more effective proactive writing assistants.

Problem

Research questions and friction points this paper is trying to address.

human-LLM co-writing

interactive writing evaluation

proactive writing assistants

user acceptance

editing behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-Loop MDP

Co-Writing Fidelity Suite

Interaction-aware Evaluation