ParquetDB: A Lightweight Python Parquet-Based Database

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional databases suffer from architectural bloat, strong external dependencies, and poor adaptability in rapidly evolving, nested-data-intensive, and research-oriented scenarios. To address these limitations, we propose ParquetDB—a lightweight, file-based, Python-native database built on PyArrow and Parquet’s columnar storage format. ParquetDB introduces a novel index-free predicate pushdown mechanism, supports schema-aware serialization, memory-mapped I/O, and vectorized query execution. Its design eliminates external dependencies while achieving high performance and cross-platform portability. Evaluated on the Alexandria 3D materials database (4.8 million deeply nested records), ParquetDB demonstrates significantly higher query throughput than SQLite and MongoDB, and achieves 3.2× faster serialization. This work establishes an efficient, minimalist, and reproducible paradigm for scientific data management.

Technology Category

Application Category

📝 Abstract
Traditional data storage formats and databases often introduce complexities and inefficiencies that hinder rapid iteration and adaptability. To address these challenges, we introduce ParquetDB, a Python-based database framework that leverages the Parquet file format's optimized columnar storage. ParquetDB offers efficient serialization and deserialization, native support for complex and nested data types, reduced dependency on indexing through predicate pushdown filtering, and enhanced portability due to its file-based storage system. Benchmarks show that ParquetDB outperforms traditional databases like SQLite and MongoDB in managing large volumes of data, especially when using data formats compatible with PyArrow. We validate ParquetDB's practical utility by applying it to the Alexandria 3D Materials Database, efficiently handling approximately 4.8 million complex and nested records. By addressing the inherent limitations of existing data storage systems and continuously evolving to meet future demands, ParquetDB has the potential to significantly streamline data management processes and accelerate research development in data-driven fields.
Problem

Research questions and friction points this paper is trying to address.

Efficient data storage
Complex data handling
Enhanced data portability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Python-based Parquet file format
Efficient serialization and deserialization
Predicate pushdown filtering
🔎 Similar Papers
No similar papers found.
L
Logan Lang
Department of Physics, West Virginia University, Morgantown, WV 26506, United States
E
Eduardo Hernandez
Instituto de Ciencia de Materiales de Madrid, Campus de Cantoblanco, C. Sor Juana Inés de la Cruz, 3, Fuencarral-El Pardo, Madrid 28049, Spain
Kamal Choudhary
Kamal Choudhary
Johns Hopkins University
Computational Material ScienceMachine learningQuantum simulationsMaterials designMaterials
A
Aldo H. Romero
Department of Physics, West Virginia University, Morgantown, WV 26506, United States