Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional Chain-of-Thought (CoT) reasoning in multimodal large language models, which relies on linear textual sequences and struggles with high-dimensional, complex tasks due to inefficient token usage and an inability to locally correct errors. To overcome these constraints, the authors propose Canvas-of-Thought (CoT-Canvas), a novel framework that introduces a mutable, structured reasoning state mechanism grounded in an HTML Canvas as an external reasoning substrate. By leveraging DOM-level CRUD atomic operations, CoT-Canvas enables in-place state updates and explicit visual feedback, augmented by a rendering-driven critique loop that transcends the static, linear nature of conventional CoT. The approach demonstrates significant performance gains over existing methods on VCode, RBench-V, and MathVista benchmarks, particularly excelling in high-dimensional multimodal tasks such as geometric reasoning and SVG design.

Technology Category

Application Category

📝 Abstract
While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the"ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Multimodal Reasoning
Mutable State
Visual Grounding
Context Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Canvas-of-Thought
mutable structured states
DOM-based CRUD operations
rendering-based critique
multimodal reasoning
🔎 Similar Papers
No similar papers found.
L
Lingzhuang Sun
University of Chinese Academy of Sciences
Y
Yuxia Zhu
Peking University
R
Ruitong Liu
Peking University
Hao Liang
Hao Liang
Peking University
Data Centric Machine LearningLarge Language ModelsMultimodal Large Language Models
Z
Zheng Sun
University of Chinese Academy of Sciences
C
Caijun Jia
University of Chinese Academy of Sciences
H
Honghao He
University of Chinese Academy of Sciences
Y
Yuchen Wu
New York University
Siyuan Li
Siyuan Li
Zhejiang University & Westlake University (Ph.D Candidate)
AIGCNetwork ArchitectureSelf-supervised LearningOptimization
Jingxuan Wei
Jingxuan Wei
University of Chinese Academy of Sciences
Natural Language ProcessingMultimodal Learning
X
Xiangxiang Zhang
University of Chinese Academy of Sciences
B
Bihui Yu
University of Chinese Academy of Sciences
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved