-
Notifications
You must be signed in to change notification settings - Fork 18
Convert Existing Databases to HDF5 Files #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-gsoc
Are you sure you want to change the base?
Conversation
1. Updated run module for numeric dataset 2. Created customized HDF5 file creator for numeric 3. Migrated all old files from msgpack to HDF5 #handle wildcard case while loading the element
1. Updated run module for nist dataset 2. Created customized HDF5 file creator for nist 3. Migrated all old files from msgpack to HDF5
d97826c to
f0eb59b
Compare
|
@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing? |
Hi @msricher, could it be that the load method in the species module is invoking the datasets run.py file? I was scrolling through the CI run outputs and noticed lines like:
My understanding of how our pytest CI worflow works is partial, but I think it only installs the direct dependencies of atomdb, and leaves out IOData, Grid, etc (i.e. the dependencies we need during development to compile the datasets). But if the run.py files are being called, these modules are in the import, so my guess is that this is what's making the test fail. @marco-2023 what do you think? |
Hi @gabrielasd, I tried to remove run imports from |
To address Issues #150 and #151, the storage of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based system with a structured HDF5 format.
Changes Made For Each Dataset
Refactored run module.
Added
h5file_creator.pyas the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties indatasets_data.h5file.Migrated existing data into the new HDF5 file.
Storage and Compression
Applied PyTables compression methods to minimize storage.
Benchmarked several methods:
Blosc2: LZ4achieved the best results in both speed and compression ratio, outperformingBlosc2,Zlib, andLZOAverage dataset size is now reduced to 400 MB – 1 GB.