🤖 AI Summary
This work addresses the limitations of current large language models in multi-step reasoning over structured tables, which often fail due to the absence of an explicit alignment mechanism between planning and execution, as well as neglect of table permutation invariance and cell grounding. The authors propose TABALIGN, a novel framework that leverages diffusion language models for table reasoning planning by generating binary cell mask representations of reasoning plans. A lightweight verifier, TABATTN, is introduced to explicitly align planning with execution by evaluating the overlap between each step’s predicted mask and a target mask according to human-annotated attention criteria. Evaluated across eight table-based question answering and fact verification benchmarks, TABALIGN achieves an average accuracy improvement of 15.76 percentage points, with the diffusion language model planner contributing a 2.87-point gain and accelerating downstream inference by 44.64%.
📝 Abstract
Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.