From Pixels to Words -- Towards Native One-Vision Models at Scale

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes NEO-ov, an end-to-end trainable, native vision-language foundation model that dispenses with external encoders, adapters, or late-fusion modules. Existing vision-language models often rely on modular designs, which fragment pixel-level information across frames and scatter early interactions between pixels and tokens, hindering fine-grained unified modeling. In contrast, NEO-ov leverages a unified Transformer architecture to directly model spatiotemporal relationships across video frames and establish precise pixel-token correspondences. The approach demonstrates, for the first time, the feasibility of an β€œall-in-one vision” architecture across multimodal tasks involving multiple images, video understanding, and spatial reasoning. It substantially narrows the performance gap with modular counterparts on fine-grained perception benchmarks. The authors release code and models to advance research in native multimodal learning.
πŸ“ Abstract
Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
native multimodal modeling
pixel-word correspondence
spatiotemporal modeling
modular framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

native vision-language model
one-vision architecture
end-to-end pixel-word alignment
spatiotemporal modeling
foundation model
πŸ”Ž Similar Papers
2024-08-29arXiv.orgCitations: 7
2024-03-04Computer Vision and Pattern RecognitionCitations: 3