🤖 AI Summary
Existing LLM alignment methods overlook the human “satisficing” decision-making principle—requiring strict satisfaction of secondary objectives (e.g., harmlessness) up to a predefined acceptability threshold while optimizing primary objectives (e.g., helpfulness). Method: We propose SITAlign, the first framework to formally integrate satisficing from behavioral economics into LLM alignment. It introduces a constraint-aware, multi-objective alignment approach operating at inference time, combining constrained optimization decoding, multi-objective reward modeling, and threshold-driven sampling. Contribution/Results: We derive a theoretical suboptimality bound for our method. On PKU-SafeRLHF, with harmlessness as a hard constraint and helpfulness as the primary objective, SITAlign improves GPT-4’s win rate by 22.3% over baselines while guaranteeing 100% compliance with the prescribed harmlessness threshold—significantly outperforming unconstrained or soft-constraint alternatives.
📝 Abstract
Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.