Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Dec 22, 2019

Description

Acquire a file lock before downloading the dataset. This helps when using multi-processing for multi-GPU training, or multi-machine training with a shared file system

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@eric-haibin-lin eric-haibin-lin requested a review from a team as a code owner December 22, 2019 22:51
@codecov
Copy link

codecov bot commented Dec 22, 2019

Codecov Report

Merging #1078 into master will increase coverage by 0.02%.
The diff coverage is 64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1078      +/-   ##
==========================================
+ Coverage   88.25%   88.28%   +0.02%     
==========================================
  Files          67       67              
  Lines        6274     6280       +6     
==========================================
+ Hits         5537     5544       +7     
+ Misses        737      736       -1
Impacted Files Coverage Δ
src/gluonnlp/data/utils.py 79.36% <100%> (+5.31%) ⬆️
src/gluonnlp/utils/files.py 45.9% <18.18%> (-3.12%) ⬇️
src/gluonnlp/model/sampled_block.py 90.74% <0%> (+0.08%) ⬆️

@eric-haibin-lin
Copy link
Member Author

#726

@eric-haibin-lin
Copy link
Member Author

Is there any recent change in numpy version on CI? The error does not seem to be relevant to the code changes.. @leezu

szha
szha previously requested changes Dec 23, 2019
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth having a file lock for such use case. Nor am I sure about taking on a dependency for this. How does it compare to an approach where files are downloaded and extracted in a unique temporary folder, and then moved to the target location? The goal of such approach is that the download action would be idempotent so that race condition doesn't matter.

@leezu
Copy link
Contributor

leezu commented Dec 24, 2019

Yes. Numpy 1.18 is released and one of our dependencies is incompatible.

@leezu
Copy link
Contributor

leezu commented Dec 24, 2019

What kind of distributed filesystem are you using? It may be more efficient to cache the downloads on each machine locally. In my experience S3 is faster than for example EFS.
For multiprocessing, a workaround is to set the MXNET_HOME variable differently for each process?

I don't think MXNet upstream does locking / supports this? So introducing it only on the GluonNLP side may not be sufficient?

Copy link
Member Author

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you signal other processes when the move is done so that others don't override the file, or don't read a corrupted file?

The goal is to have atomic file level operations so that multiple processes can read/write a complete file. Here are some candidates I previously think of:

  • shutil.copyfile: it's implementation is not atomic
  • os.rename(): atomic but does not support across file system. We do have lots of users using EFS/FSX.
  • shutil.move: not atomic

Having a file locker is just a lot simpler.

@szha szha dismissed their stale review December 25, 2019 06:05

concern addressed and the approach is reasonable

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.rename(): atomic but does not support across file system.

I'm not sure how this is related to our use-case. Files in the same directory will always be in the same filesystem. Here you are interested in downloading a file and making it atomically available once the download is complete. This operation does not involve cross-filesystem renames.

We do have lots of users using EFS/FSX.

It's a bad practice to keep the MXNET_HOME on EFS. EFS is slower than just using S3 or EBS. Also it is much more expensive. As a cheap and better performant alternative I suggest you look into https://github.com/kahing/goofys/

But in either way, this should be unrelated to atomic rename and locking.

Also how about the upstream support in MXNet (#1078 (comment))?

Finally, why not https://docs.python.org/3/library/os.html#os.lockf ?

@eric-haibin-lin
Copy link
Member Author

@leezu os.lockf is only available for Unix. I assume we don't want to stop supporting windows users. The advantage is that portalocker is portable across different operating systems.

Yes, the support should be upstream'd to both gluon-nlp and mxnet

@leezu
Copy link
Contributor

leezu commented Dec 27, 2019

Yes, if file locking is supported on Windows and we want to use file-locking, we should also support Windows.

So how about we first use idempotent operations in GluonNLP until support is upstreamed? Then we can reuse the upstream implementation.

@sxjscience
Copy link
Member

I think we can use the filelock. It's also used in sacrebleu: https://github.com/mjpost/sacreBLEU/blob/069b0c88fceb29f3e24c3c19ba25342a3e7f96cb/sacrebleu.py#L1154

@eric-haibin-lin
Copy link
Member Author

So how about we first use idempotent operations in GluonNLP until support is upstreamed? Then we can reuse the upstream implementation.

What do you recommend as the way to implement idempotent operations and avoid race condition?

@leezu
Copy link
Contributor

leezu commented Jan 6, 2020

File-renames within the same directory are idempotent and atomic. You mentioned above that rename is not supported across filesystems, but we don't need to cross-filesystem rename. The temporary file with randomized name should be in the same directory as the "target" file.

@mli
Copy link
Member

mli commented Jan 14, 2020

Job PR-1078/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1078/3/index.html

@mli
Copy link
Member

mli commented Jan 14, 2020

Job PR-1078/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1078/4/index.html

@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-1078/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1078/5/index.html

Copy link
Member Author

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leezu I removed portalocker from the dependency. Could you check my current implementation?

@eric-haibin-lin eric-haibin-lin added the release focus Progress focus for release label Jan 15, 2020
@eric-haibin-lin eric-haibin-lin merged commit 29c9572 into dmlc:master Jan 15, 2020
leezu added a commit to haven-jeon/gluon-nlp that referenced this pull request Jan 28, 2020
@eric-haibin-lin eric-haibin-lin deleted the port branch February 2, 2020 06:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants