Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based tabular feature engineering methods suffer from monolithic model architectures, reliance on quantitative feedback alone, and insufficient integration of domain knowledge. To address these limitations, we propose Rogue One—a multi-agent framework comprising three specialized agents: a Scientist (generating scientific hypotheses), an Extractor (constructing features), and a Tester (evaluating generalization). The framework leverages RAG-enhanced retrieval, decentralized agent collaboration, and a “generalization-pruning” strategy to incorporate qualitative feedback and interpretable evaluation, enabling knowledge-guided, iterative feature exploration. Crucially, it supports hypothesis generation and empirical validation. Empirically, Rogue One achieves significant improvements over state-of-the-art methods across 19 classification and 9 regression benchmarks. Moreover, it successfully identifies biologically meaningful novel biomarkers in myocardial data—demonstrating its capacity to bridge interpretable AI with domain-specific scientific discovery and advance their synergistic evolution.

Technology Category

Application Category

📝 Abstract
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a"flooding-pruning"strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Automating feature extraction for tabular data using multi-agent LLM framework
Integrating domain knowledge via RAG to create interpretable predictive features
Overcoming limitations of monolithic LLM architectures through collaborative agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with three specialized LLM agents
Flooding-pruning strategy balancing feature exploration and exploitation
Retrieval-augmented system integrating external domain knowledge
H
Henrik Brådland
School of Computing and Information, University of Pittsburgh, Pittsburgh, USA
Morten Goodwin
Morten Goodwin
Professor, Centre for Artificial Intelligence Research, University of Agder
machine learningdeep learningneural networksswarm intelligence
V
V. I. Zadorozhny
School of Computing and Information, University of Pittsburgh, Pittsburgh, USA
Per-Arne Andersen
Per-Arne Andersen
Associate Professor at University of Agder
Reinforcement LearningMachine LearningDeep LearningCybersecurityTsetlin Machine