Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses how heterogeneous agents with non-shared visual representations can achieve effective communication through discrete symbol exchange in the absence of external supervision or shared perceptual grounding. The authors propose a decentralized learning framework within a Metropolis-Hastings Captioning Game, where agents collaboratively generate a shared image caption by iteratively updating their models based solely on local perceptual evidence. The approach employs frozen, heterogeneous vision encoders, randomly initialized language modules, and a discrete token-exchange mechanism. Experiments on MS-COCO demonstrate that the method significantly outperforms non-communicating baselines, improving cross-agent alignment, visual feature prediction, and image–text retrieval performance. Notably, the work provides the first empirical evidence that shared symbolic conventions can spontaneously emerge from local perceptual evaluations, and reveals that the similarity between vision encoders critically influences both the content and symmetry of the emergent language.

📝 Abstract

Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.

Problem

Research questions and friction points this paper is trying to address.

emergent communication

heterogeneous agents

visual representation

decentralized learning

symbol emergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

emergent communication

heterogeneous agents

decentralized learning