Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing self-supervised methods predominantly focus on either object recognition or motion understanding in isolation, lacking a unified visual representation that jointly models semantics and dynamics. Method: This paper introduces latent dynamics modeling into self-supervised visual learning—the first such effort—via a mid-level top-down reasoning pathway and a hierarchical network architecture to jointly learn semantic segmentation and optical flow estimation. It further incorporates dense forward prediction objectives and forward feature perturbation analysis to explicitly model high-level semantic–motion correspondences. Contribution/Results: Pretrained on two large-scale natural video datasets, our method achieves state-of-the-art performance among self-supervised approaches on both semantic segmentation and optical flow estimation downstream tasks, demonstrating superior general-purpose visual representation capability.

Technology Category

Application Category

📝 Abstract

Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

Problem

Research questions and friction points this paper is trying to address.

Learning representations for recognition and motion from videos

Extending latent dynamics modeling to visual perception tasks

Tackling multi-object scenes in natural video data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised architecture learning recognition and motion

Extends latent dynamics modeling to natural video domain

Uses midway top-down path and hierarchical prediction objective

🔎 Similar Papers

No similar papers found.

Authors to Follow