🤖 AI Summary
Existing evaluation methods struggle to comprehensively assess the multidimensional capabilities of large language models (LLMs) in kilo-character-scale open-ended Chinese writing, and both traditional metrics and LLM-as-a-judge approaches exhibit significant biases. To address this, this work proposes the Tree-of-Writing (ToW) framework, which introduces a structured evaluation of long-form writing quality by explicitly modeling the hierarchical structure of sub-features and their dynamic weights. Building upon ToW, we construct HoWToBench, a large-scale Chinese writing benchmark comprising 12 genres and 1,302 instructions. The effectiveness of our approach is validated through human annotation, Pearson correlation analysis, and robustness testing. Experimental results demonstrate that ToW achieves a correlation of 0.93 with human judgments—substantially outperforming existing methods—and exhibits greater robustness against textual perturbations.
📝 Abstract
Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.