Decision-oriented Text Evaluation

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing intrinsic NLG evaluation metrics—such as n-gram overlap and sentence fluency—exhibit weak correlation with real-world decision outcomes in high-stakes domains. Method: This paper introduces the first decision-oriented text evaluation framework, specifically designed for market briefing generation in finance. It pioneers the use of human investors’ and autonomous LLM agents’ trading performance as primary evaluation criteria, moving beyond superficial textual features. Experiments employ morning summaries and closing commentary, quantifying how generated texts influence human–AI collaborative decision-making in live trading tasks. Contribution/Results: When relying solely on summary texts, neither humans nor LLMs significantly outperform random baselines. However, texts exhibiting analytical depth substantially improve joint human–LLM decision accuracy and financial returns. These findings empirically validate the effectiveness and practical utility of decision-utility–driven evaluation.

Technology Category

Application Category

📝 Abstract

Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating NLG text impact on human and LLM decisions

Assessing decision quality using financial trade outcomes

Highlighting limitations of traditional intrinsic evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decision-oriented framework for text evaluation

Human-LLM collaborative decision-making assessment

Financial performance-based trade outcome measurement

🔎 Similar Papers

No similar papers found.