Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

πŸ“… 2026-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

212K/year
πŸ€– AI Summary
This work proposes Tuna-2, a natively unified multimodal architecture that overcomes the limitations of conventional multimodal models, which typically rely on pretrained vision encoders and employ disparate visual representations for understanding and generation tasks, thereby hindering end-to-end optimization. Tuna-2 eliminates the need for variational autoencoders (VAEs) or task-specific vision encoders by directly modeling both visual understanding and generation from pixel-level embeddings. Using a simple image patch embedding layer, it achieves end-to-end pixel-level multimodal learning without any pretrained vision encoderβ€”a first in the field. The model attains state-of-the-art performance across multiple multimodal benchmarks, excelling particularly in fine-grained visual perception tasks while also enabling high-quality image generation.

Technology Category

Application Category

πŸ“ Abstract
Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
vision encoders
pixel embeddings
end-to-end optimization
visual representation alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

pixel embeddings
unified multimodal model
encoder-free architecture
end-to-end learning
multimodal understanding and generation
πŸ”Ž Similar Papers