How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

📅 2026-01-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Although large language models excel in semantic understanding, their ability to reconstruct deterministic structures from shuffled inputs remains unclear, largely due to the absence of an automatically evaluable benchmark. This work proposes OrderProbe, the first framework that leverages fixed four-character expressions from Chinese, Japanese, and Korean—each possessing a unique canonical order—to formulate a structure reconstruction task amenable to exact-match evaluation. Employing a multidimensional diagnostic framework assessing semantic fidelity, logical validity, consistency, robustness, and information density, experiments across twelve prominent models reveal that state-of-the-art models achieve zero-shot structural recovery accuracy consistently below 35%, demonstrating a significant decoupling between semantic comprehension and structural planning capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
Problem

Research questions and friction points this paper is trying to address.

structural reconstruction
order sensitivity
large language models
word order
canonical order
Innovation

Methods, ideas, or system contributions that make the work stand out.

OrderProbe
structural reconstruction
deterministic benchmark
large language models
order sensitivity
🔎 Similar Papers
No similar papers found.
Y
Yingjie He
Peking University
Z
Zhaolu Kang
Peking University
K
Kehan Jiang
Peking University
Q
Qianyuan Zhang
The Chinese University of Hong Kong, Shenzhen
J
Jiachen Qian
City University of Hong Kong
Chunlei Meng
Chunlei Meng
Fudan University
Embodied Ai,Multimodal,Multi-agent
Yujie Feng
Yujie Feng
The Hong Kong Polytechnic University
Natural Language ProcessingLarge Language Models
Yuan Wang
Yuan Wang
Zhejiang University
Medical MLLM
J
Jiabao Dou
Peking University
Aming Wu
Aming Wu
Ph.D.
Deep learningData mining
L
Leqi Zheng
Tsinghua University
Pengxiang Zhao
Pengxiang Zhao
Zhejiang university
LLMAI Agent
J
Jiaxin Liu
University of Illinois Urbana-Champaign
Z
Zeyu Zhang
Peking University
L
Lei Wang
Peking University
G
Guansu Wang
Peking University
Q
Qishi Zhan
Peking University
X
Xiao-Tong He
Peking University
M
Meisheng Zhang
Peking University
J
Jianyuan Ni
Marquette University