🤖 AI Summary
This work addresses the limitations of existing large language model evaluators in open-domain, multi-dimensional scientific writing assessment tasks that require domain-specific knowledge, where task-specific fine-tuning is often prohibitively expensive. To overcome this, the authors propose a two-stage training framework: first learning an overall quality score for scientific writing through preference optimization, then explicitly enhancing the model’s ability to reason about multiple evaluation criteria—such as logical coherence, factual accuracy, and clarity. Built upon an open-source large language model, the approach introduces a multi-dimensional joint training mechanism that enables a single reward model to generalize across diverse scientific writing evaluation scenarios without task-specific fine-tuning. Experimental results demonstrate that the resulting reward model significantly outperforms existing methods across multiple tasks, exhibiting strong cross-task generalization and adaptability to unseen evaluation criteria.
📝 Abstract
Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.