-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[BFCL] Support Dynamic max_tokens for Locally-Hosted Models #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fanjia-Yan
approved these changes
Nov 2, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CharlieJCJ
approved these changes
Nov 5, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Repository owner
deleted a comment from
yutongxie58
Nov 7, 2024
VishnuSuresh27
pushed a commit
to VishnuSuresh27/gorilla
that referenced
this pull request
Nov 11, 2024
…atil#712) This is an optimization that can sometimes improve the performance of models with low context windows. According to OpenAI's spec, the `max_tokens` for the completions endpoint is the maximum number of tokens that can be generated. It does not include the input token count. For example, if a model has a context length of 4096, our input message takes 1000 tokens and you set the `max_tokens` to 4096, then that would error because the total number of tokens (1000 in input and 4096 requesed for output) exceeds the model's context window length. So what we want to do is that, before call the completions endpoint, use the model's tokenizer to figure out how many tokens the input message `formatted_prompt` has used, subtract that amount from the model's maximum context length and supply that value for the `max_token` argument so that we can try to avoid getting the maximum length exceeded error. In short, we want to allow the model to generate as many as possible till the limit of its context length. PS, even for the multi turn long context category, in a single turn, the maximum token used by the ground truth function call is much much less than 4096 (calculated using tiktoken). So this should be a safe threshold. > max: 1299 min: 24 average: 125.96
HuanzhiMao
added a commit
that referenced
this pull request
Nov 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an optimization that can sometimes improve the performance of models with low context windows.
According to OpenAI's spec, the
max_tokens
for the completions endpoint is the maximum number of tokens that can be generated. It does not include the input token count. For example, if a model has a context length of 4096, our input message takes 1000 tokens and you set themax_tokens
to 4096, then that would error because the total number of tokens (1000 in input and 4096 requesed for output) exceeds the model's context window length. So what we want to do is that, before call the completions endpoint, use the model's tokenizer to figure out how many tokens the input messageformatted_prompt
has used, subtract that amount from the model's maximum context length and supply that value for themax_token
argument so that we can try to avoid getting the maximum length exceeded error. In short, we want to allow the model to generate as many as possible till the limit of its context length.PS, even for the multi turn long context category, in a single turn, the maximum token used by the ground truth function call is much much less than 4096 (calculated using tiktoken). So this should be a safe threshold.