🤖 AI Summary
This work addresses the challenge that large language models struggle to accurately determine the convexity of symbolic expressions in deeply composed functional settings, revealing a significant deficiency in compositional reasoning. The study presents the first systematic evaluation of this issue, identifying two primary failure modes: parsing failures and lazy reasoning. To overcome these limitations, the authors propose an agent-based divide-and-conquer reasoning framework that integrates abstract syntax tree parsing, external tool invocation, recursive subexpression reasoning, and a focused context mechanism. Using ConvexBench—a scalable, mechanically verifiable benchmark—the method demonstrates a dramatic improvement in performance, raising the F1 score from approximately 0.2 to 1.0 on composite functions with composition depths up to 100, thereby effectively mitigating the performance degradation caused by deep functional composition.
📝 Abstract
Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textit{whether LLMs can identify the convexity of a symbolic objective under deep functional composition.} Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of $1.0$ at depth $2$ to approximately $0.2$ at depth $100$. Inspection of models'reasoning traces indicates two failure modes: \textit{parsing failure} and \textit{lazy reasoning}. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score $= 1.0$ at depth $100$).