🤖 AI Summary
Large language models (LLMs) exhibit weak structured algorithmic reasoning in matching markets—particularly in stable matching generation, instability detection, and preference-based repair—due to failures in identifying blocking pairs and iteratively executing combinatorial algorithms under large-scale ranked preferences. Method: We introduce the first benchmark for ranked-preference reasoning, covering multi-level algorithmic tasks, and evaluate LLMs using standard inference and LoRA fine-tuning. Contribution/Results: Top-tier LLMs consistently fail on large-market instances; LoRA improves only small-scale performance, exposing fundamental bottlenecks in algorithmic reasoning under long contexts. This work is the first to systematically reveal LLMs’ structural limitations in preference-driven combinatorial reasoning, providing both a rigorous evaluation framework and critical implications for deploying trustworthy AI in resource allocation and other mission-critical domains.
📝 Abstract
The rise of Large Language Models (LLMs) has driven progress in reasoning tasks -- from program synthesis to scientific hypothesis generation -- yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate several state-of-the-art models on a hierarchy of preference-based reasoning tasks -- ranging from stable-matching generation to instability detection, instability resolution, and fine-grained preference queries -- to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement on large instances, suggesting the need for more sophisticated strategies to improve LLMs' reasoning with larger-context inputs.