🤖 AI Summary
To address two structural bottlenecks in vision-language models—high energy consumption from RGB image transmission and excessively long attention sequences—this paper proposes 2D Gaussian lattices as a novel lightweight visual representation. It introduces Gaussian lattices into vision-language pretraining for the first time, designing a lattice-aware input structure and a perception-aware resampler that achieves cross-modal alignment with only 7% parameter fine-tuning. The method integrates structured initialization, luminance-aware pruning, and batched CUDA kernel optimization, enabling end-to-end training via contrastive learning while freezing the RGB Transformer backbone. Evaluated on a DataComp subset, it achieves strong zero-shot ImageNet-1K classification accuracy, attains 3–20× input compression, accelerates model fitting by over 90×, and sustains ~97% GPU utilization—demonstrating high efficiency and feasibility for edge-cloud collaborative learning.
📝 Abstract
Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.