What if Othello-Playing Language Models Could See?

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Can language models achieve effective symbol grounding solely from text? This work addresses this question by modeling the structured world of Othello. We propose VISOTHELLO, a multimodal model that jointly encodes move sequences and board images, trained via next-action prediction. Compared to text-only baselines, VISOTHELLO achieves significantly higher action prediction accuracy and demonstrates greater representational robustness under semantically irrelevant image perturbations. Our experiments show that visual input substantially improves the stability and generalization of language models’ internal representations of discrete, rule-governed world states. This provides an empirically testable multimodal pathway toward symbol grounding: visual grounding enables language models to construct more accurate and robust structured world models.

Technology Category

Application Category

📝 Abstract

Language models are often said to face a symbol grounding problem. While some argue that world understanding can emerge from text alone, others suggest grounded learning is more efficient. We explore this through Othello, where the board state defines a simplified, rule-based world. Building on prior work, we introduce VISOTHELLO, a multi-modal model trained on move histories and board images. Using next-move prediction, we compare it to mono-modal baselines and test robustness to semantically irrelevant perturbations. We find that multi-modal training improves both performance and the robustness of internal representations. These results suggest that grounding language in visual input helps models infer structured world representations.

Problem

Research questions and friction points this paper is trying to address.

Exploring if visual input improves language model grounding in rule-based worlds

Comparing multi-modal vs mono-modal training for next-move prediction robustness

Investigating whether vision enhances internal representation of structured environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal model combining text and visual inputs

Training on move histories and board images

Improving performance and representation robustness

🔎 Similar Papers

No similar papers found.