Skip to content

Conversation

@tikkss
Copy link
Contributor

@tikkss tikkss commented Aug 11, 2025

GitHub: fix GH-242

Red Datasets sometimes handle large files. Downloading the same data multiple times is not suitable for this use case. We should not download per worker.

Add a cache check not only before acquiring the lock, but also after acquiring the lock.

While checking only after acquiring the lock would be enough, it would create a lock file unnecessarily, so we kept the pre-lock cache check as well.

This patch will avoid duplicate downloads in the following case:

sequenceDiagram
  participant P1 as Process 1
  participant P2 as Process 2
  participant FS as File System

  P1->>FS: check output_path.exist? => false
  P2->>FS: check output_path.exist? => false
  P1->>FS: create lock => success
  P2->>FS: create lock => failure (sleep 1~10s)
  P1->>FS: download
  P1->>FS: delete lock
  P2->>FS: create lock => success
  Note over P2: No re-check after lock
  P2->>FS: download (duplicate)
  P2->>FS: delete lock
Loading

@tikkss
Copy link
Contributor Author

tikkss commented Aug 11, 2025

CI has failing. It may relate GH-203.
After invalidate GitHub Actions Cache, the tests may have passed.

I will open a separated PR tomorrow.

@tikkss
Copy link
Contributor Author

tikkss commented Aug 12, 2025

I opened a separated PR GH-245. After invalidating GitHub Actions Cache, CI has been successful.

Thanks for your suggestions. I'll rebase and add a commit if GH-245 is merged.

tikkss and others added 2 commits August 14, 2025 06:15
GitHub: fix red-data-toolsGH-242

Red Datasets sometimes handle large files. Downloading the same data
multiple times is not suitable for this use case. We should not download
per worker.

Add a cache check not only before acquiring the lock, but also after
acquiring the lock.

While checking only after acquiring the lock would be enough, it would
create a lock file unnecessarily, so we kept the pre-lock cache check
as well.

This patch will avoid duplicate downloads in the following case:

```mermaid
sequenceDiagram
  participant P1 as Process 1
  participant P2 as Process 2
  participant FS as File System

  P1->>FS: check output_path.exist? => false
  P2->>FS: check output_path.exist? => false
  P1->>FS: create lock => success
  P2->>FS: create lock => failure (sleep 1~10s)
  P1->>FS: download
  P1->>FS: delete lock
  P2->>FS: create lock => success
  Note over P2: No re-check after lock
  P2->>FS: download (duplicate)
  P2->>FS: delete lock
```
@tikkss tikkss force-pushed the downloader-avoid-duplicate-downloads branch from 74bf4ba to 73e6ec0 Compare August 13, 2025 21:20
@tikkss
Copy link
Contributor Author

tikkss commented Aug 13, 2025

I rebased and added a commit. CI has been successful!

@kou kou merged commit f8c1baa into red-data-tools:main Aug 14, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

downloader: duplicate downloads in parallel case

3 participants