ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

To address weak rule interpretability and limited performance in few-shot and multilingual text classification, this paper proposes an iterative rule induction and synthetic data generation framework. Methodologically, it automatically induces generalized, interpretable rules via syntactic n-grams and establishes a rule-driven closed-loop iteration: rules provide supervision signals to refine synthetic data quality, while high-quality synthetic data, in turn, enables rule refinement; the framework supports bootstrapped rule selection and unified adaptation to both in-context learning (ICL) and fine-tuning. Its key innovation is the first realization of synergistic, closed-loop optimization between rule induction and data generation—without requiring external annotations. Experiments span three full-data, eight few-shot, and seven multilingual settings. Using either the induced rules or synthetic data alone yields substantial improvements over strong baselines—including complex contrastive learning—significantly enhancing performance in few-shot, multilingual, and in-context learning scenarios.

Technology Category

Application Category

📝 Abstract

We propose ARISE, a framework that iteratively induces rules and generates synthetic data for text classification. We combine synthetic data generation and automatic rule induction, via bootstrapping, to iteratively filter the generated rules and data. We induce rules via inductive generalisation of syntactic n-grams, enabling us to capture a complementary source of supervision. These rules alone lead to performance gains in both, in-context learning (ICL) and fine-tuning (FT) settings. Similarly, use of augmented data from ARISE alone improves the performance for a model, outperforming configurations that rely on complex methods like contrastive learning. Further, our extensive experiments on various datasets covering three full-shot, eight few-shot and seven multilingual variant settings demonstrate that the rules and data we generate lead to performance improvements across these diverse domains and languages.

Problem

Research questions and friction points this paper is trying to address.

Iterative rule induction

Synthetic data generation

Text classification improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative rule induction

Synthetic data generation

Bootstrapping for filtering

🔎 Similar Papers

No similar papers found.