Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing LLM-based tabular feature engineering methods suffer from monolithic model architectures, reliance on quantitative feedback alone, and insufficient integration of domain knowledge. To address these limitations, we propose Rogue One—a multi-agent framework comprising three specialized agents: a Scientist (generating scientific hypotheses), an Extractor (constructing features), and a Tester (evaluating generalization). The framework leverages RAG-enhanced retrieval, decentralized agent collaboration, and a “generalization-pruning” strategy to incorporate qualitative feedback and interpretable evaluation, enabling knowledge-guided, iterative feature exploration. Crucially, it supports hypothesis generation and empirical validation. Empirically, Rogue One achieves significant improvements over state-of-the-art methods across 19 classification and 9 regression benchmarks. Moreover, it successfully identifies biologically meaningful novel biomarkers in myocardial data—demonstrating its capacity to bridge interpretable AI with domain-specific scientific discovery and advance their synergistic evolution.

Technology Category

Application Category

📝 Abstract

The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a"flooding-pruning"strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.

Problem

Research questions and friction points this paper is trying to address.

Automating feature extraction for tabular data using multi-agent LLM framework

Integrating domain knowledge via RAG to create interpretable predictive features

Overcoming limitations of monolithic LLM architectures through collaborative agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with three specialized LLM agents

Flooding-pruning strategy balancing feature exploration and exploitation

Retrieval-augmented system integrating external domain knowledge

🔎 Similar Papers

Retrieval-Augmented Feature Generation for Domain-Specific Classification