-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: Prefetch safetensors files before loading them #4140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Prefetch safetensors files before loading them #4140
Conversation
/bot run |
PR_Github #4469 [ run ] triggered by Bot |
PR_Github #4469 [ run ] completed with state |
8e825a1
to
fdeee42
Compare
/bot run |
PR_Github #4525 [ run ] triggered by Bot |
PR_Github #4525 [ run ] completed with state |
fdeee42
to
0616016
Compare
/bot run |
PR_Github #4611 [ run ] triggered by Bot |
PR_Github #4611 [ run ] completed with state |
/bot reuse-build |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot reuse-pipeline |
PR_Github #4688 [ reuse-pipeline ] triggered by Bot |
/bot kill |
Prefetching safetensors files so that they are stored in the system file cache. This significantly speeds up the model weight loading for the very first run after entering the docker container. This is beneficial because model weight loading is done layer-by-layer, which means reading from the safetensors chunk-by-chunk, and that cannot utilize the internet bandwidth very well, assuming that these files are stored in some network drives. Instead, loading the whole files in bulk can achieve higher internet bandwidth utilization. When running with world_size>1, all ranks collaboratedly prefetch these files. In theory, we should add heuristics to decide whether to prefetch the files or not, but that is beyond the scope of this commit. For example, when the CPU memory is small, doing prefetching may result in file cache thrashing, resulting in slower weight loading time. Signed-off-by: Po-Han Huang <[email protected]>
c670ae9
to
3b80ccb
Compare
/bot run |
PR_Github #4689 [ kill ] triggered by Bot |
PR_Github #4688 [ reuse-pipeline ] completed with state |
PR_Github #4689 [ kill ] completed with state |
PR_Github #4690 [ run ] triggered by Bot |
PR_Github #4690 [ run ] completed with state |
/bot run |
PR_Github #4802 [ run ] triggered by Bot |
PR_Github #4802 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #4837 [ run ] triggered by Bot |
PR_Github #4837 [ run ] completed with state |
Prefetching safetensors files so that they are stored in the system file cache. This significantly speeds up the model weight loading for the very first run after entering the docker container.
This is beneficial because model weight loading is done layer-by-layer, which means reading from the safetensors chunk-by-chunk, and that cannot utilize the internet bandwidth very well, assuming that these files are stored in some network drives. Instead, loading the whole files in bulk can achieve higher internet bandwidth utilization.
When running with world_size>1, all ranks collaboratedly prefetch these files.
In theory, we should add heuristics to decide whether to prefetch the files or not, but that is beyond the scope of this commit.
For example, when the CPU memory is small, doing prefetching may result in file cache thrashing, resulting in slower weight loading time.