CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the limitations of multimodal large language models in converting tabular images to LaTeX, which often suffer from insufficient fidelity in structure, style, and content. Conventional reinforcement learning approaches rely on a single aggregated reward signal, leading to reward ambiguity and suboptimal optimization. To overcome this, the paper proposes Component-Specific Policy Optimization (CSPO), a novel framework that decomposes the reinforcement learning reward into three distinct signals—structure, style, and content—aligned with corresponding segments of the generated output. Gradients are backpropagated only through tokens relevant to each component, enabling targeted policy updates. Experimental results demonstrate that CSPO significantly improves fidelity across all three dimensions in table-to-LaTeX generation, confirming the effectiveness of component-level reward mechanisms for structured text generation.

Technology Category

Application Category

📝 Abstract
Tables contain rich structured information, yet when stored as images their contents remain "locked" within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.
Problem

Research questions and friction points this paper is trying to address.

reward ambiguity
structured generation
table-to-LaTeX
multimodal large language models
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Component-Specific Policy Optimization
Reward Disentanglement
Table-to-LaTeX Generation
Multimodal LLMs
Structured Output Generation