DeepSeek-OCR 2: Visual Causal Flow

📅 2026-01-28

📈 Citations: 2

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes DeepEncoder V2, a novel vision-language architecture that addresses the limitations of conventional models which process images using a fixed raster-scan order, thereby failing to emulate human-like, semantics-driven visual perception—particularly in complex layouts where causal perceptual sequencing is essential. To overcome this, DeepEncoder V2 introduces causal reasoning into visual token ordering for the first time, employing a causal-flow-driven dynamic reordering mechanism coupled with learnable semantic sequence modeling. The framework constructs a two-stage cascaded one-dimensional causal structure designed to approximate genuine two-dimensional reasoning. By moving beyond fixed positional encodings, the method significantly enhances cognitive consistency in interpreting complex layout images. The authors release both code and model weights, establishing a new architectural paradigm for vision-language models.

Technology Category

Application Category

📝 Abstract

We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.

Problem

Research questions and friction points this paper is trying to address.

visual token ordering

causal reasoning

vision-language models

2D image understanding

human visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

dynamic token reordering

vision-language models