Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing LLM alignment methods overlook the human “satisficing” decision-making principle—requiring strict satisfaction of secondary objectives (e.g., harmlessness) up to a predefined acceptability threshold while optimizing primary objectives (e.g., helpfulness). Method: We propose SITAlign, the first framework to formally integrate satisficing from behavioral economics into LLM alignment. It introduces a constraint-aware, multi-objective alignment approach operating at inference time, combining constrained optimization decoding, multi-objective reward modeling, and threshold-driven sampling. Contribution/Results: We derive a theoretical suboptimality bound for our method. On PKU-SafeRLHF, with harmlessness as a hard constraint and helpfulness as the primary objective, SITAlign improves GPT-4’s win rate by 22.3% over baselines while guaranteeing 100% compliance with the prescribed harmlessness threshold—significantly outperforming unconstrained or soft-constraint alternatives.

Technology Category

Application Category

📝 Abstract

Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with multifaceted human preferences

Optimizing primary objectives while meeting secondary thresholds

Improving inference-time alignment via satisficing strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Satisficing alignment for LLM inference

Threshold-based constraints on secondary criteria

Maximizing primary objective with sub-optimality bounds

🔎 Similar Papers

Are language models rational? The case of coherence norms and belief revision