Skip to content

Conversation

ucalyptus
Copy link
Contributor

Overview

This pull request introduces a new optional flag, enforce_eager, to the VLLM serving script. This enhancement gives users control over the execution mode of the model, allowing them to choose between eager execution or the default hybrid CUDA graph approach based on their specific needs. See https://docs.vllm.ai/en/stable/api/offline_inference/llm.html#vllm.LLM

Changes Made

  • Added a new boolean parameter enforce_eager_flag to the ScriptArguments class in trl/scripts/vllm_serve.py
  • Updated the main function to pass this flag as enforce_eager to the model configuration
  • Set the default value to False to maintain backward compatibility with existing behavior

Benefits

  • Memory Optimization: Enforcing eager execution can reduce memory usage in certain scenarios
  • Performance Flexibility: Allows users to choose between memory efficiency and potential performance gains
  • Better Control: Provides explicit control over execution strategy without modifying source code

Technical Details

The changes primarily affect:

  • trl/scripts/vllm_serve.py - Added new flag to the argument parser and passed it to the model configuration

Testing

Verified the flag works correctly in both states:

  • When enforce_eager_flag=True: Model uses pure eager execution
  • When enforce_eager_flag=False (default): Model uses hybrid CUDA graph execution

Related Issues

Fixes #XXXX

Checklist Before Submitting

Who Can Review?

Anyone in the community is welcome to review once tests have passed. Team members with expertise in model serving or CUDA optimization would be particularly valuable reviewers.

Enable the user to set Eager Execution instead of building cuda graph to save memory.
@ucalyptus
Copy link
Contributor Author

@qgallouedec

ucalyptus and others added 3 commits April 21, 2025 16:38
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec changed the title Add Optional Eager Execution Mode for VLLM Serving 🙋 Add Optional Eager Execution Mode for vLLM Serving Apr 21, 2025
@qgallouedec qgallouedec merged commit b4ffda7 into huggingface:main Apr 21, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants