Fastest way to track large datasets #1494

prayaggordy · 2025-05-27T02:43:40Z

prayaggordy
May 27, 2025

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I'm using targets 1.11.3.

I'm trying to figure out the best way to track many large datasets. I have administrative data saved as Parquet files, partitioned by dataset then year then state. Some individual year/state files can reach 60+ GB, with my largest dataset totaling nearly 2 TB.

I'll explain the setup for just one dataset, which I track in a dynamically branched target called dataset1. We have 6 years X 51 states = 306 branches. I would expect this target to run almost instantaneously, but it still takes some time. Here's the setup:

targets::tar_option_set(controller = crew::crew_controller_group(controller_primary,
                                                                 controller_max),
                        trust_timestamps = TRUE)

list(
  targets::tar_target(years,
                      c("2016", "2017", "2018", "2019", "2020", "2021")),
  targets::tar_target(states,
                      sort(c(state.abb, "DC"))),
  targets::tar_target(dataset1, 
                      glue::glue(data_path, 
                                 year = years, state = states),
                      pattern = cross(years, states),
                      format = "file",
                      resources = targets::tar_resources(
                        crew = targets::tar_resources_crew(controller = "my_controller_max")
                      ),
                      deployment = "worker")
)

Here's the result printed in the console:

+ dataset1 declared [306 branches]                                                
✔ dataset1 completed [2.7s, 16.30 GB]                         
✔ ended pipeline [29s, 306 completed, 2 skipped]

I have no idea where the extra time goes — the target itself says it takes 2.7 seconds to run, but the entire pipeline (which only contains that target) takes 29 seconds. It's not a problem here when the 306 branches track a combined 16 GB of data, but it does become a problem when the total size across branches for a target reaches 2 TB. (Checking if the branches are outdated also takes an exorbitant amount of time.)

I assumed that file tracking based solely on timestamp would take the same amount of time regardless of file size, since calling du -hs/similar doesn't seem to be slower for larger files. But that's not true in targets.

After reading the manual and many of the posts here, I decided to run profvis::profvis(targets::tar_make(dataset1, callr_function = NULL)). If I click through the "data" view, I see that an entry titled file_hash takes more than 99% of the total time. I don't think it's actually hashing my dataset because I don't see anything saved in the _targets/ folder.

Do you have any suggestions for speeding up the tracking here? I don't expect the files to change, but I do want to build a robust pipeline in case I re-ingest the raw data. I also want to avoid using cues if possible.

I would also be eager to learn more about how targets works. How does the file tracking work when I set trust_timestamps = TRUE? Where does the extra time go? What does file_hash do in this case and why does it account for the majority of the runtime? Is this a problem with working on a HPC? Thank you!

wlandau · 2025-05-27T17:22:59Z

wlandau
May 27, 2025
Maintainer

targets relies on hashes for tracking changes to all its files. It does try to fall back on file sizes and modification timestamps to detect changes to those hashes (if trust_timestamps = TRUE or if your file system supports high-precision timestamps). But regardless, every target needs a hash. The first time you run the pipeline,targets may spend a lot of time hashing, as you observed. But if the file and its size/timestamp do not change, subsequent runs of the pipeline should be much quicker.

I don't think it's actually hashing my dataset because I don't see anything saved in the _targets/ folder.

It does compute the hash, and it stores that hash is stored in the metadata (_targets/meta/meta).

2 replies

prayaggordy May 28, 2025
Author

Thanks for your quick reply. If I understand you correctly, the first time I run a file target, it will hash the file. If trust_timestamps = TRUE, future runs of the pipeline will first check timestamp and size, and if those don't align, then recompute and check the hash?

I had assumed hashing would be a much more expensive operation. I have 36 cores and about 180GB of memory at my disposal. Would you suggest I fully parallelize my file targets, even though some are large? I've heard that trying to do file I/O in parallel can get a bit funky.

wlandau May 29, 2025
Maintainer

Thanks for your quick reply. If I understand you correctly, the first time I run a file target, it will hash the file. If trust_timestamps = TRUE, future runs of the pipeline will first check timestamp and size, and if those don't align, then recompute and check the hash?

Yes, exactly.

I had assumed hashing would be a much more expensive operation. I have 36 cores and about 180GB of memory at my disposal. Would you suggest I fully parallelize my file targets, even though some are large? I've heard that trying to do file I/O in parallel can get a bit funky.

targets lets you do this. 36 cores and 180 GB memory may help, but it may also depend on the type of file system you are using, and who you are sharing it with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fastest way to track large datasets #1494

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fastest way to track large datasets #1494

Uh oh!

prayaggordy May 27, 2025

Help

Description

Replies: 1 comment · 2 replies

Uh oh!

wlandau May 27, 2025 Maintainer

Uh oh!

prayaggordy May 28, 2025 Author

Uh oh!

wlandau May 29, 2025 Maintainer

prayaggordy
May 27, 2025

Replies: 1 comment 2 replies

wlandau
May 27, 2025
Maintainer

prayaggordy May 28, 2025
Author

wlandau May 29, 2025
Maintainer