🤖 AI Summary
Existing evaluation benchmarks struggle to disentangle whether multimodal large language models fail in table understanding due to an inability to perceive visual markers—such as highlighting or bolding—or a deficiency in logical reasoning grounded in those markers, resulting in significant blind spots. To address this gap, this work introduces HighlightBench, the first diagnostic benchmark that systematically decouples visual marker perception from symbolic reasoning. It decomposes marker-driven table understanding into five structured tasks: marker localization, constraint-based retrieval, local relations, aggregation and comparison, and consistency and missingness analysis. The benchmark further incorporates an interpretable reference pipeline with explicit intermediate decisions. Experiments reveal that even state-of-the-art multimodal models exhibit instability when aligning visual cues with structured constraints, exposing critical limitations in marker-driven reasoning and filling a crucial void in current evaluation methodologies.
📝 Abstract
Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.