Skip to content

Conversation

mohitpalsingh
Copy link
Contributor

@mohitpalsingh mohitpalsingh commented Jul 12, 2025

max-model-len Feature Implementation

Overview

This implementation adds support for the max-model-len parameter, which defines the model's context window - the maximum number of tokens in a single request including both input and output tokens.

Feature Details

Configuration

  • Parameter: --max-model-len
  • Default Value: 1024 tokens
  • Type: Integer
  • Description: Defines the model's context window size

Validation Logic

When a request is received, the simulator validates:

prompt_tokens + max_completion_tokens <= max_model_len

If this condition is violated, the request is rejected with HTTP 400 status code.

Error Response Format

When the context window limit is exceeded, the following error response is returned:

{
  "object": "error",
  "message": "This model's maximum context length is <Z> tokens. However, you requested <Y> tokens (<X> in the messages, <Y> in the completion). Please reduce the length of the messages or completion.",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

Where:

  • <Z> = max_model_len value
  • <X> = number of tokens in the prompt/messages
  • <Y> = max_tokens requested for completion

Implementation Details

Files Modified

  1. config.go

    • Added MaxModelLen int field to configuration struct
    • Added default value (1024) in newConfig()
    • Added validation in validate() method
  2. simulator.go

    • Added command line flag --max-model-len
    • Added context window validation in handleCompletions()
    • Fixed sendCompletionError() to use correct HTTP status code
  3. utils.go

    • Added validateContextWindow() function for validation logic
  4. request.go

    • Added getMaxCompletionTokens() method to completionRequest interface
    • Implemented the method for both chat and text completion requests

Test Coverage

Added comprehensive tests covering:

  • Configuration validation for invalid max-model-len values
  • Context window validation logic (unit tests)
  • Integration tests for both chat and text completion APIs
  • HTTP response format validation
  • Successful requests within context window limits

Usage Examples

Command Line

# Start simulator with 2048 token context window
./llm-d-inference-sim --model my-model --max-model-len 2048

# Default 1024 token context window
./llm-d-inference-sim --model my-model

Configuration File

max-model-len: 2048

API Requests

Request that exceeds context window:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "This is a very long prompt..."}],
    "max_tokens": 100
  }'

Response (HTTP 400):

{
  "object": "error",
  "message": "This model's maximum context length is 50 tokens. However, you requested 120 tokens (20 in the messages, 100 in the completion). Please reduce the length of the messages or completion.",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

Compatibility

  • Works with both chat completion (/v1/chat/completions) and text completion (/v1/completions) endpoints
  • Supports both max_tokens and max_completion_tokens parameters
  • Compatible with streaming and non-streaming requests
  • Maintains backward compatibility (feature is optional with sensible default)

Testing

Run the test suite to verify the implementation:

go test ./pkg/llm-d-inference-sim/ -v

All tests pass (87/89 specs) including the new context window validation tests.

Copy link
Collaborator

@irar2 irar2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! Very thorough. Please see a minor comment

func validateContextWindow(promptTokens int, maxCompletionTokens *int64, maxModelLen int) error {
if maxModelLen <= 0 {
return nil // no limit configured
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that maxModelLen is <= 0? You added a check in configuration validate()
if c.MaxModelLen < 1 { return errors.New("max model len cannot be less than 1") }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @irar2 , you're right. This is redundant. Removing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Can you please check now?

It("should pass when no max model length is configured", func() {
promptTokens := 1000
maxCompletionTokens := int64(1000)
maxModelLen := 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this.

Entry(tests[9].name, tests[9].args),
Entry(tests[10].name, tests[10].args),
Entry(tests[11].name, tests[11].args),
Entry(tests[11].name, tests[11].args),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge is incorrect, there are 11 tests in main, and you added another one, so there should be an additional line with
Entry(tests[12].name, tests[12].args)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@mohitpalsingh mohitpalsingh requested a review from irar2 July 13, 2025 07:51
@mohitpalsingh
Copy link
Contributor Author

Fixed static lint errors


totalTokens := int64(promptTokens) + completionTokens
if totalTokens > int64(maxModelLen) {
return fmt.Errorf("this model's maximum context length is %d tokens. However, you requested %d tokens (%d in the messages, %d in the completion). Please reduce the length of the messages or completion",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be consistent with vLLM's error messages, which start with capital letter.
Please return error message to be sent to the client and use it in HTTP response creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that to lowercase to eliminate a static error check but I guess that was only for error messages ending with punctuation. Nevertheless I fixed it now.

Let me know if there is any other issue?

@mohitpalsingh mohitpalsingh requested a review from mayabar July 13, 2025 11:13

totalTokens := int64(promptTokens) + completionTokens
if totalTokens > int64(maxModelLen) {
return fmt.Errorf("This model's maximum context length is %d tokens. However, you requested %d tokens (%d in the messages, %d in the completion). Please reduce the length of the messages or completion",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mohitpalsingh, the error message cannot be capitalized (the error: "error strings should not be capitalized"), and actually there is no need in error, we need only message that should be returned in HTTP response. This is the reason I suggested to return error message only and use it in the HTTP response payload.
You can run locally "make lint" to test lint issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayabar I got the issue now. I've refactored the validation function to instead return bool and form the error inside the caller function. This solved both lint and coherency issues.

Let me know if this approach works?

@mayabar
Copy link
Collaborator

mayabar commented Jul 14, 2025

@mohitpalsingh can you please update the readme file with new command line option, thanks ;)

@mohitpalsingh
Copy link
Contributor Author

README is updated with the new flag 🐼

@mohitpalsingh mohitpalsingh requested a review from mayabar July 14, 2025 08:30
@irar2 irar2 merged commit 9f3d093 into llm-d:main Jul 14, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants