Latent Principle Discovery for Language Model Self-Improvement

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing language model (LM) self-improvement methods rely heavily on manually defined human preference principles and costly human annotations, limiting scalability and generalizability. Method: We propose an annotation-free framework for automatic discovery of implicit preference principles. For the first time, we introduce posterior-regularized Monte Carlo EM into LM self-improvement, jointly leveraging latent-variable modeling and response trajectory clustering to extract, compress, and strategically invoke implicit behavioral principles—encoded in the model’s own reasoning paths—that align with human preferences. We further design a self-correcting training mechanism enabling multi-round, bootstrapped optimization of 7–8B LMs. Results: Our approach achieves +8–10% AlpacaEval win rate, +0.3 average score on MT-Bench, and +19–23% principle adherence accuracy on IFEval, while generating highly interpretable and diverse AI “constitutions”—formalized sets of preference-aligned behavioral constraints.

Technology Category

Application Category

📝 Abstract

When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

Problem

Research questions and friction points this paper is trying to address.

Automate discovery of latent behavioral attributes for LM self-improvement

Condense mined principles into interpretable set via clustering

Enable smaller LMs to self-improve via iterative bootstrapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates latent attribute discovery via self-correction modeling

Compresses principles via clustering for interpretability

Uses posterior-regularized Monte Carlo EM for principle optimization

🔎 Similar Papers

No similar papers found.