Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work proposes Statsformer, a constrained ensemble framework that integrates semantic priors from large language models (LLMs) into a collection of linear and nonlinear base learners while mitigating hallucination risks. Unlike existing approaches that rely on heuristic rules or single learners—rendering them prone to hallucinations or lacking adaptability—Statsformer adaptively weights base learners through cross-validation and embeds LLM-derived priors under explicit constraints. Theoretically, the framework guarantees performance no worse than any convex combination of base learners up to statistical error, and it automatically downweights ineffective or erroneous priors. Empirical results demonstrate significant performance gains when LLM priors are informative, while the system robustly suppresses misleading or uninformative priors, thereby alleviating the adverse effects of hallucinations.

Technology Category

Application Category

📝 Abstract

We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.An open-source implementation of Statsformer is available at https://github.com/pilancilab/statsformer.

Problem

Research questions and friction points this paper is trying to address.

LLM-derived knowledge

supervised statistical learning

LLM hallucination

semantic priors

ensemble learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

ensemble learning

LLM-derived priors

semantic priors