Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses a critical limitation in existing financial recommendation benchmarks, which treat user behavior as ground truth, conflating behavioral mimicry with decision quality while neglecting long-term objectives and individual risk preferences. We propose the first financial recommendation benchmark that integrates conversational interaction with longitudinal market context, generating stock rankings over fixed investment horizons through user interviews, market dynamics, and advisory dialogues. Our framework distinguishes descriptive behavior from normative utility by introducing multi-perspective reference standards that disentangle noisy user actions from rational, risk-aware preferences. Built on real market data and human decision trajectories, our dataset and controllable dialogue pipeline reveal that high-utility models often diverge from actual user choices, whereas behaviorally aligned models tend to overfit short-term noise. Code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

financial recommendation

behavioral imitation

utility-grounded evaluation

longitudinal benchmark

LLM decision quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

utility-grounded recommendation

conversational benchmark

longitudinal evaluation