Fastest way to track large datasets #1494
Replies: 1 comment 2 replies
-
It does compute the hash, and it stores that hash is stored in the metadata ( |
Beta Was this translation helpful? Give feedback.
-
It does compute the hash, and it stores that hash is stored in the metadata ( |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Help
Description
I'm using
targets
1.11.3.I'm trying to figure out the best way to track many large datasets. I have administrative data saved as Parquet files, partitioned by dataset then year then state. Some individual year/state files can reach 60+ GB, with my largest dataset totaling nearly 2 TB.
I'll explain the setup for just one dataset, which I track in a dynamically branched target called
dataset1
. We have 6 years X 51 states = 306 branches. I would expect this target to run almost instantaneously, but it still takes some time. Here's the setup:Here's the result printed in the console:
I have no idea where the extra time goes — the target itself says it takes 2.7 seconds to run, but the entire pipeline (which only contains that target) takes 29 seconds. It's not a problem here when the 306 branches track a combined 16 GB of data, but it does become a problem when the total size across branches for a target reaches 2 TB. (Checking if the branches are outdated also takes an exorbitant amount of time.)
I assumed that file tracking based solely on timestamp would take the same amount of time regardless of file size, since calling
du -hs
/similar doesn't seem to be slower for larger files. But that's not true intargets
.After reading the manual and many of the posts here, I decided to run
profvis::profvis(targets::tar_make(dataset1, callr_function = NULL))
. If I click through the "data" view, I see that an entry titledfile_hash
takes more than 99% of the total time. I don't think it's actually hashing my dataset because I don't see anything saved in the_targets/
folder.Do you have any suggestions for speeding up the tracking here? I don't expect the files to change, but I do want to build a robust pipeline in case I re-ingest the raw data. I also want to avoid using cues if possible.
I would also be eager to learn more about how
targets
works. How does the file tracking work when I settrust_timestamps = TRUE
? Where does the extra time go? What doesfile_hash
do in this case and why does it account for the majority of the runtime? Is this a problem with working on a HPC? Thank you!Beta Was this translation helpful? Give feedback.
All reactions