🤖 AI Summary
Existing approaches to learning implicit representations that fuse semantic and spatial information often rely on dense feature maps or task-specific heads, resulting in poor efficiency and limited generalization. This work proposes a lightweight representation learning framework that, for the first time, explicitly constructs a geometric structure in latent space guided by object identity and spatial proximity. By leveraging multi-scale local receptive fields, the framework achieves unified spatio-semantic encoding and introduces a proximity-aware encoder coupled with a local locator to directly regress coordinates from this compact representation. Notably, the method eliminates the need for dense feature maps or dedicated prediction heads. On facial landmark localization, it reduces parameters and FLOPs to 1/4 and 1/2.2, respectively, of those in state-of-the-art lightweight models while maintaining real-time inference capability on CPU.
📝 Abstract
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.