A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Traditional computational fluid dynamics (CFD) simulations of complex flows in energy systems are computationally expensive due to strong nonlinearities and multi-physics couplings. This work proposes a SwinV2-UNet modeling framework based on vision Transformers, which, for the first time, adapts large-scale vision Transformers to complex flow prediction. The approach explicitly encodes data modality and temporal information through auxiliary conditional tokens, enabling autoregressive spatiotemporal rollout predictions and reconstruction of missing flow fields. By integrating multi-fidelity, multi-modal CFD data, the method demonstrates exceptional generalization across varying resolutions, turbulence models, and equations of state. It accurately predicts flow evolution and reconstructs unobserved flow fields in high-pressure gas injection scenarios.

Technology Category

Application Category

📝 Abstract

Computational fluid dynamics (CFD) simulations of complex fluid flows in energy systems are prohibitively expensive due to strong nonlinearities and multiscale-multiphysics interactions. In this work, we present a transformer-based modeling framework for prediction of fluid flows, and demonstrate it for high-pressure gas injection phenomena relevant to reciprocating engines. The approach employs a hierarchical Vision Transformer (SwinV2-UNet) architecture that processes multimodal flow datasets from multi-fidelity simulations. The model architecture is conditioned on auxiliary tokens explicitly encoding the data modality and time increment. Model performance is assessed on two different tasks: (1) spatiotemporal rollouts, where the model autoregressively predicts the flow state at future times; and (2) feature transformation, where the model infers unobserved fields/views from observed fields/views. We train separate models on multimodal datasets generated from in-house CFD simulations of argon jet injection into a nitrogen environment, encompassing multiple grid resolutions, turbulence models, and equations of state. The resulting data-driven models learn to generalize across resolutions and modalities, accurately forecasting the flow evolution and reconstructing missing flow-field information from limited views. This work demonstrates how large vision transformer-based models can be adapted to advance predictive modeling of complex fluid flow systems.

Problem

Research questions and friction points this paper is trying to address.

fluid flow prediction

computational fluid dynamics

energy systems

multimodal data

nonlinear multiscale interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer

multimodal modeling

fluid flow prediction