🤖 AI Summary
This work addresses the challenge of enabling multiple agents to learn interoperable, unified representations from heterogeneous viewpoints without parameter sharing or explicit communication. The authors propose a predictive coding–based world model in which each agent independently trains its latent space, spontaneously converging to approximately linear isometric mappings across agents—thereby achieving cross-view representation alignment without explicit coordination. This approach reveals, for the first time, that decentralized agents can self-organize geometrically consistent latent structures, offering a novel paradigm for lightweight multi-view systems. Experiments demonstrate that classifiers trained on one agent’s representation can be directly transferred to others without fine-tuning, maintaining high alignment accuracy even under extreme viewpoint disparities and minimal pixel overlap, while significantly reducing computational overhead.
📝 Abstract
World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.