REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study systematically evaluates the efficacy of integrating large language models (LLMs) with causal feature learning for clinical prognosis prediction. Addressing seven distinct clinical outcomes across two real-world datasets, we benchmark 15 state-of-the-art LLMs against six conventional machine learning models and three causal discovery algorithms—establishing the first LLM–causal joint benchmark for clinical prognosis. Results show that current LLMs do not outperform traditional models; direct injection of causal features yields only marginal gains; and strong assumptions underlying classical causal discovery methods frequently fail in complex, high-dimensional clinical data. The core contribution lies in empirically revealing the untapped potential of synergistic LLM–causal reasoning, rigorously identifying critical limitations of existing integration strategies, and providing actionable evidence and principled directions for developing interpretable, robust clinical AI systems grounded in causal foundations.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs'emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM integration with causal features for clinical risk prediction

Assessing whether LLMs combined with causal features outperform traditional ML methods

Addressing the lack of systematic benchmarks for LLM-causal learning synergy in healthcare

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLM integration with causal features

Compares 15 LLMs with traditional ML and causal algorithms

Uses causal discovery methods to enhance clinical predictions

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey