Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address two structural bottlenecks in vision-language models—high energy consumption from RGB image transmission and excessively long attention sequences—this paper proposes 2D Gaussian lattices as a novel lightweight visual representation. It introduces Gaussian lattices into vision-language pretraining for the first time, designing a lattice-aware input structure and a perception-aware resampler that achieves cross-modal alignment with only 7% parameter fine-tuning. The method integrates structured initialization, luminance-aware pruning, and batched CUDA kernel optimization, enabling end-to-end training via contrastive learning while freezing the RGB Transformer backbone. Evaluated on a DataComp subset, it achieves strong zero-shot ImageNet-1K classification accuracy, attains 3–20× input compression, accelerates model fitting by over 90×, and sustains ~97% GPU utilization—demonstrating high efficiency and feasibility for edge-cloud collaborative learning.

Technology Category

Application Category

📝 Abstract

Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.

Problem

Research questions and friction points this paper is trying to address.

Developing compressed image representations using 2D Gaussian Splatting

Reducing transmission costs and computational demands of vision pipelines

Creating efficient edge-cloud learning with semantic transmission capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed image representation using 2D Gaussian Splatting

Scalable pipeline with structured initialization and pruning

Adapted CLIP training with lightweight splat-aware input stem

🔎 Similar Papers

No similar papers found.