HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Existing evaluation methods struggle to comprehensively assess the multidimensional capabilities of large language models (LLMs) in kilo-character-scale open-ended Chinese writing, and both traditional metrics and LLM-as-a-judge approaches exhibit significant biases. To address this, this work proposes the Tree-of-Writing (ToW) framework, which introduces a structured evaluation of long-form writing quality by explicitly modeling the hierarchical structure of sub-features and their dynamic weights. Building upon ToW, we construct HoWToBench, a large-scale Chinese writing benchmark comprising 12 genres and 1,302 instructions. The effectiveness of our approach is validated through human annotation, Pearson correlation analysis, and robustness testing. Experimental results demonstrate that ToW achieves a correlation of 0.93 with human judgments—substantially outperforming existing methods—and exhibits greater robustness against textual perturbations.

Technology Category

Application Category

📝 Abstract
Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluation
writing capability
holistic assessment
text generation metrics
human-level writing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-of-Writing
LLM evaluation
writing benchmark
human-level writing
robustness to textual perturbations