NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work proposes a nested mixture-of-experts (MoE) neural operator to address the limitations of existing architectures, which struggle to effectively model the heterogeneity and complex dependencies inherent in partial differential equation (PDE) systems due to their reliance on a single, fixed structure. The proposed approach introduces a novel two-level MoE framework—operating at both the image and token levels—to enable input-adaptive expert activation: the image-level MoE captures global dependencies, while the token-level MoE focuses on local dynamics. Trained jointly across twelve heterogeneous PDE datasets, the model demonstrates significantly enhanced generalization, computational efficiency, and cross-task transfer performance in multi-task settings, thereby advancing the scalability and effectiveness of large-scale pretraining for neural operators.

Technology Category

Application Category

📝 Abstract

Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

neural operators

PDE pre-training

Mixture-of-Experts

heterogeneous features

system dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Operator

Mixture-of-Experts

Large-Scale Pre-training