-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Misc] Make download_weights_from_hf
more reliable
#23863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve the reliability of downloading weights from the Hugging Face Hub by handling potential timeout errors during file listing and iterating through different file patterns. My review focuses on enhancing the robustness of these changes. I've identified a potential UnboundLocalError
that could lead to a crash and an overly broad exception handler that could mask other issues. The feedback provided addresses these points to ensure the code is more resilient and maintainable.
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hmellor!
I'm still wondering about whether we should reorg the logic to skip the remote calls altogether in some cases, for example when a revision is provided and already resides locally (with one of the allowed patterns). Or even if rev isn't provided and any of the allowed patterns reside locally.
But would be good to get this change in first regardless, hopefully it will help a lot with the CI failures
Signed-off-by: Harry Mellor <[email protected]>
We could do this one in a followup.
I still think this would make reliably updating a checkpoint extremely difficult.
Yeah, I've actioned both of your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hmellor
download_weights_from_hf
more reliabledownload_weights_from_hf
more reliable
Issue:
HfFileSystem.ls
have been causing issues with vLLM's CIallow_patterns
to 1Fix:
allow_patterns
unchanged (i.e. whenload_format == "auto"
this is["*.safetensors", "*.bin", "*.pt"]
)So what happens?
HfFileSystem.ls
fails, vLLM will still callsnapshot_download
for eachallow_pattern
individuallysnapshot_download
will fall back to the local cache in the event ofHTTPError
,ConnectionError
,Timeout
, making this approach significantly more robust