Skip to content

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Aug 7, 2024

This PR introduces multi-threading to parallel the API call to the hosted model endpoints and significantly speeds up the model response generation process.

User can specify the number of threads to use for parallel inference by setting the --num-threads flag. The default is 1, which means no parallel inference.

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test on hosted models e.g. GPT, Claude, Cohere, Gorilla Openfunctions models, LGTM

Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by @CharlieJCJ

@ShishirPatil ShishirPatil merged commit de8307b into ShishirPatil:main Aug 12, 2024
@HuanzhiMao HuanzhiMao deleted the parallel-inference branch August 12, 2024 06:01
@ShishirPatil
Copy link
Owner

Hey @HuanzhiMao can you also update the README with this flag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants