Can Large Language Models Generalize Procedures Across Representations?

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work investigates the limited cross-modal generalization of large language models (LLMs) from symbolic representations—such as code and graphs—to semantically equivalent natural language tasks, particularly in procedural reasoning. The authors propose a two-stage curriculum training strategy: initial pretraining on symbolic data followed by transfer to natural language. This approach not only reveals but also effectively enhances the model’s generative analogical reasoning capability. Evaluated on the 1.5B-parameter Qwen model, the method achieves performance on natural language planning tasks that approaches the zero-shot level of GPT-4o. Furthermore, consistent and significant improvements are demonstrated across multiple model families and diverse procedural tasks, underscoring the robustness and generalizability of the proposed framework.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.

Problem

Research questions and friction points this paper is trying to address.

large language models

cross-representation generalization

natural language

symbolic representations

procedural tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-representation generalization

two-stage curriculum

generative analogy