Convert Existing Databases to HDF5 Files #160

enjyashraf18 · 2025-08-21T13:50:15Z

To address Issues #150 and #151, the storage of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based system with a structured HDF5 format.

Changes Made For Each Dataset

Refactored run module.
Added h5file_creator.py as the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties in datasets_data.h5 file.
Migrated existing data into the new HDF5 file.

Storage and Compression

Applied PyTables compression methods to minimize storage.
Benchmarked several methods: Blosc2: LZ4 achieved the best results in both speed and compression ratio, outperforming Blosc2, Zlib, and LZO

Average dataset size is now reduced to 400 MB – 1 GB.

1. Updated run module for numeric dataset 2. Created customized HDF5 file creator for numeric 3. Migrated all old files from msgpack to HDF5 #handle wildcard case while loading the element

1. Updated run module for nist dataset 2. Created customized HDF5 file creator for nist 3. Migrated all old files from msgpack to HDF5

msricher · 2025-08-21T16:02:08Z

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

gabrielasd · 2025-08-21T20:52:28Z

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

Hi @msricher, could it be that the load method in the species module is invoking the datasets run.py file? I was scrolling through the CI run outputs and noticed lines like:

atomdb/species.py:791: in load
dataset_submodule = import_module(f"atomdb.datasets.{dataset}.h5file_creator")
...
atomdb/datasets/gaussian/h5file_creator.py:6: in
from atomdb.datasets.gaussian.run import NPOINTS
...
from gbasis.evals.density import evaluate_density as eval_dens
E ModuleNotFoundError: No module named 'gbasis'

My understanding of how our pytest CI worflow works is partial, but I think it only installs the direct dependencies of atomdb, and leaves out IOData, Grid, etc (i.e. the dependencies we need during development to compile the datasets). But if the run.py files are being called, these modules are in the import, so my guess is that this is what's making the test fail.

@marco-2023 what do you think?

enjyashraf18 · 2025-08-22T01:41:52Z

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

Hi @msricher, could it be that the load method in the species module is invoking the datasets run.py file? I was scrolling through the CI run outputs and noticed lines like:

atomdb/species.py:791: in load
dataset_submodule = import_module(f"atomdb.datasets.{dataset}.h5file_creator")
...
atomdb/datasets/gaussian/h5file_creator.py:6: in
from atomdb.datasets.gaussian.run import NPOINTS
...
from gbasis.evals.density import evaluate_density as eval_dens
E ModuleNotFoundError: No module named 'gbasis'

My understanding of how our pytest CI worflow works is partial, but I think it only installs the direct dependencies of atomdb, and leaves out IOData, Grid, etc (i.e. the dependencies we need during development to compile the datasets). But if the run.py files are being called, these modules are in the import, so my guess is that this is what's making the test fail.

@marco-2023 what do you think?

Hi @gabrielasd, I tried to remove run imports from h5file_creator.py and hardcoded NPOINTS as a trial to see if that was the root cause, so the only place run is called is inside compile_species (same as before), But the CI is still failing on the missing modules.

enjyashraf18 added 4 commits August 21, 2025 17:06

Convert numeric dataset files from msgpack to HDF5

da48e1b

1. Updated run module for numeric dataset 2. Created customized HDF5 file creator for numeric 3. Migrated all old files from msgpack to HDF5 #handle wildcard case while loading the element

Convert nist dataset files from msgpack to HDF5

4362a58

1. Updated run module for nist dataset 2. Created customized HDF5 file creator for nist 3. Migrated all old files from msgpack to HDF5

Convert gaussian dataset files from msgpack to HDF5

f2f2499

refactor uhf_augccpvdz dataset / add migration script

f0eb59b

enjyashraf18 force-pushed the data-conversion branch from d97826c to f0eb59b Compare August 21, 2025 14:10

msricher requested review from gabrielasd and msricher August 21, 2025 15:28

msricher mentioned this pull request Oct 31, 2025

Merge in refactor of database #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert Existing Databases to HDF5 Files #160

Convert Existing Databases to HDF5 Files #160

Uh oh!

enjyashraf18 commented Aug 21, 2025

Uh oh!

msricher commented Aug 21, 2025

Uh oh!

gabrielasd commented Aug 21, 2025

Uh oh!

enjyashraf18 commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Convert Existing Databases to HDF5 Files #160

Are you sure you want to change the base?

Convert Existing Databases to HDF5 Files #160

Uh oh!

Conversation

enjyashraf18 commented Aug 21, 2025

Uh oh!

msricher commented Aug 21, 2025

Uh oh!

gabrielasd commented Aug 21, 2025

Uh oh!

enjyashraf18 commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants