🤖 AI Summary
Long-standing debates regarding whether large language models (LLMs) or logic reasoning models (LRMs) possess genuine reasoning capabilities—beyond superficial pattern matching—are hindered by evaluation goal drift, undermining validity and comparability.
Method: This paper introduces two analytically grounded benchmark postulates—the Bhatt Conjectures—based on logical tautologies (self-evident truths), formally defining necessary conditions for human-like reasoning. It integrates tools from analytic philosophy, formal logic, and externalized mental modeling, replacing empirical benchmarking with conceptual clarification and axiomatic modeling.
Contribution: The work establishes the first empirically testable theoretical framework for reasoning evaluation, providing a rigorous conceptual anchor for AI reasoning assessment. By grounding evaluation in logical necessity rather than statistical correlation, it resolves goal drift and shifts the paradigm from associative, data-driven metrics toward principled evaluation of causal and deductive competence.
📝 Abstract
Debates about whether Large Language or Reasoning Models (LLMs/LRMs) truly reason or merely pattern-match suffer from shifting goal posts. In my personal opinion, two analytic--hence"tautological"--benchmarks cut through that fog in my mental model. In this paper, I attempt to write down my mental model in concrete terms.