CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the challenge of information extraction from large-scale cosmological simulation data—specifically, point clouds (dark matter halos and galaxies) and directed trees (merger histories). We introduce CosmoBench, the first comprehensive geometric deep learning benchmark for cosmology, comprising 34,000 point cloud samples and 25,000 merger trees. It supports multi-scale, multi-temporal, and multi-task learning for cosmological parameter inference and structural evolution prediction. Methodologically, we propose the first unified framework jointly modeling point cloud geometry and tree-structured temporal evolution, integrating symmetry-constrained linear models, graph neural networks, and classical fitting techniques. Baseline experiments reveal that lightweight invariant-feature models outperform complex deep architectures on several tasks—demonstrating the high quality of our data curation and benchmark design. CosmoBench thus establishes a rigorous foundation for the synergistic advancement of machine learning and cosmology.

Technology Category

Application Category

📝 Abstract

Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training times. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app

Problem

Research questions and friction points this paper is trying to address.

Extract insights from cosmological simulation data

Predict cosmological parameters from point clouds and trees

Reconstruct merger trees on finer time scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest multiscale cosmology dataset for deep learning

Combines point clouds and directed trees tasks

Benchmarks machine learning vs traditional cosmology methods

🔎 Similar Papers

A Survey of Geometric Graph Neural Networks: Data Structures, Models and Applications