🤖 AI Summary
This work investigates how to leverage user-edit data—comprising contextual prompts, model responses, and subsequent user revisions—to efficiently fine-tune large language models for enhanced personalization and adaptability. The authors propose a unified framework that, for the first time, formally models the three types of implicit feedback embedded in user edits: preference, supervision, and reward signals. Through theoretical analysis, they reveal inherent trade-offs among the corresponding learning algorithms and devise an integrated strategy to effectively fuse these heterogeneous signals. The analysis yields generalization error bounds that inform the fine-tuning process. Experimental results on two benchmark domains demonstrate that the proposed approach significantly outperforms fine-tuning strategies based on any single feedback type and exhibits strong robustness and adaptability across diverse user-edit distributions.
📝 Abstract
We study how to fine-tune LLMs using user-edit deployment data consisting of a set of context, an agent's response, and user edits. This deployment data is naturally generated by users in applications such as LLMs-based writing assistants and coding agents. The _natural_ origin of user edits makes it a desired source for adapting and personalizing LLMs. In this setup, there emerges a unification of various feedback types namely preferences, supervised labels, and cost that are typically studied separately in the literature. In this paper, we initiate the theoretical investigation of learning from user edits. We first derive bounds for learning algorithms that learn from each of these feedback types. We prove that these algorithms have different trade-offs depending upon the user, data distribution, and model class. We then propose a simple ensembling procedure to jointly learn from these feedback types. On two domains adapted from Gao et al. 2024, we show our ensembling procedure outperforms these methods that learn from individual feedback. Further, we show that our proposed procedure can robustly adapt to different user-edit distributions at test time.