Thinking with Drafting: Optical Decompression via Logical Reconstruction

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the imprecision in visual outputs of multimodal large language models during complex visual reasoning, often stemming from their neglect of underlying logical structure. The authors propose a “parsing-as-reasoning” paradigm that reframes visual reasoning as an optical decompression process, leveraging a minimal domain-specific language (DSL) as an intermediate representation. This DSL guides the model to generate executable code sketches, which in turn drive a self-verification mechanism for deterministic visual proofs. By centering the reasoning loop around executable DSL constructs, the framework ensures that visual generation serves logical validation rather than creative expression. Evaluated on the newly introduced VisAlg benchmark for visual algebra, the method demonstrates significant improvements in both reasoning accuracy and logical rigor.

Technology Category

Application Category

📝 Abstract
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual reasoning
logical topology
precision paradox
optical perception
visual artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

optical decompression
logical reconstruction
Domain-Specific Language (DSL)
visual reasoning
deterministic visual proofs
🔎 Similar Papers
No similar papers found.
Jingxuan Wei
Jingxuan Wei
University of Chinese Academy of Sciences
Natural Language ProcessingMultimodal Learning
H
Honghao He
Shenyang Institute of Computing Technology, Chinese Academy of Sciences
C
Caijun Jia
Shenyang Institute of Computing Technology, Chinese Academy of Sciences
S
Siyuan Li
ByteDance
Z
Zheng Sun
Shenyang Institute of Computing Technology, Chinese Academy of Sciences
Y
Yuhang Xu
Shenyang Institute of Computing Technology, Chinese Academy of Sciences
Yuanyuan Lin
Yuanyuan Lin
The Chinese University of Hong Kong
Statistics
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
Y
Yuchen Wu
ByteDance
B
Bihui Yu
Shenyang Institute of Computing Technology, Chinese Academy of Sciences
X
Xiangxiang Zhang
ByteDance
C
Cheng Tan
Westlake University