Skip to content

Conversation

hmellor
Copy link
Member

@hmellor hmellor commented Aug 28, 2025

Issue:

  • Uncaught timeout errors in HfFileSystem.ls have been causing issues with vLLM's CI
  • This method is being used to determine the file format of the weights in the HF repo and reduce the length of allow_patterns to 1

Fix:

  • Allow this check to fail, leaving allow_patterns unchanged (i.e. when load_format == "auto" this is ["*.safetensors", "*.bin", "*.pt"])
  • Then, check each pattern one at a time and break the loop as soon as something has been downloaded

So what happens?

  • The vast majority of the time, this will behave exactly the same as before this PR
  • In the event that HfFileSystem.ls fails, vLLM will still call snapshot_download for each allow_pattern individually
  • snapshot_download will fall back to the local cache in the event of HTTPError, ConnectionError, Timeout, making this approach significantly more robust

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve the reliability of downloading weights from the Hugging Face Hub by handling potential timeout errors during file listing and iterating through different file patterns. My review focuses on enhancing the robustness of these changes. I've identified a potential UnboundLocalError that could lead to a crash and an overly broad exception handler that could mask other issues. The feedback provided addresses these points to ensure the code is more resilient and maintainable.

Signed-off-by: Harry Mellor <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hmellor!

I'm still wondering about whether we should reorg the logic to skip the remote calls altogether in some cases, for example when a revision is provided and already resides locally (with one of the allowed patterns). Or even if rev isn't provided and any of the allowed patterns reside locally.

But would be good to get this change in first regardless, hopefully it will help a lot with the CI failures

Signed-off-by: Harry Mellor <[email protected]>
@hmellor
Copy link
Member Author

hmellor commented Aug 29, 2025

I'm still wondering about whether we should reorg the logic to skip the remote calls altogether in some cases, for example when a revision is provided and already resides locally (with one of the allowed patterns).

We could do this one in a followup.

Or even if rev isn't provided and any of the allowed patterns reside locally.

I still think this would make reliably updating a checkpoint extremely difficult.

But would be good to get this change in first regardless, hopefully it will help a lot with the CI failures

Yeah, I've actioned both of your comments.

@hmellor hmellor requested a review from njhill August 29, 2025 12:03
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hmellor

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 29, 2025
@njhill njhill changed the title Make download_weights_from_hf more reliable [Misc] Make download_weights_from_hf more reliable Aug 29, 2025
@njhill njhill merged commit 5674a40 into vllm-project:main Aug 29, 2025
48 checks passed
@hmellor hmellor deleted the hf-download-reliable branch August 29, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants