Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing approaches struggle to effectively learn pixel-level spatiotemporal consistent representations in dynamic visual scenes. This work proposes the LILA framework, which introduces linear in-context learning into unsupervised pixel-level representation learning for the first time. By leveraging geometric priors derived from temporal cues such as depth and optical flow, LILA learns feature descriptors that jointly capture semantic meaning and geometric consistency directly from video without requiring fine-grained annotations. The method demonstrates significant performance gains over current state-of-the-art approaches across multiple tasks—including video object segmentation, surface normal estimation, and semantic segmentation—thereby validating its strong representational capacity and generalization ability in complex dynamic environments.

📝 Abstract

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

pixel-level reasoning

spatio-temporal representation

dynamic 3D scenes

vision foundation models

dense prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

linear in-context learning

pixel-level representation

spatio-temporal cues