Skip to content

Commit 44b630d

Browse files
committed
merge with main
Signed-off-by: Sophie du Couédic <[email protected]>
2 parents 46cb6ab + 2d65d56 commit 44b630d

File tree

15 files changed

+1212
-35
lines changed

15 files changed

+1212
-35
lines changed

.github/workflows/lint_docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@ jobs:
2929
- name: Install dependencies
3030
run: uv sync --frozen --only-group lint
3131
- name: Lint docs
32-
run: tools/lint_docs.sh
32+
run: pymarkdownlnt scan docs -r

docker/Dockerfile.amd64

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# This is a reference dockerfile for vLLM Spyre support on an x86 host
22
ARG BASE_IMAGE_URL="quay.io/ibm-aiu/spyre-base"
3-
ARG BASE_IMAGE_TAG="2025_06_23-amd64"
3+
ARG BASE_IMAGE_TAG="2025_07_18-amd64"
44

55
##############################################
66
# Base

docs/.nav.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ nav:
1818
- Developer Guide:
1919
- Contributing: contributing/README.md
2020
- Continuous Batching:
21+
- Overview: contributing/continuous_batching/overview.md
2122
- Tests:
2223
- Output Tests: contributing/continuous_batching/tests/output_tests.md
2324
- Scheduler Steps Tests: contributing/continuous_batching/tests/scheduler_steps_tests.md
@@ -37,6 +38,7 @@ nav:
3738
- Developer Guide:
3839
- Contributing: contributing/README.md
3940
- Continuous Batching:
41+
- Overview: contributing/continuous_batching/overview.md
4042
- Tests:
4143
- Output Tests: contributing/continuous_batching/tests/output_tests.md
4244
- Scheduler Steps Tests: contributing/continuous_batching/tests/scheduler_steps_tests.md
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Continuous Batching tests / inference scripts in vLLM
2+
3+
Brief overview of what has been implemented so far in VLLM to test / debug continuous batching
4+
5+
## Inference script
6+
7+
* **File paths:**
8+
* `examples/offline_inference/cb_spyre_inference.py`
9+
* `examples/offline_inference/long_context.py`
10+
* **Purpose:** Debugging (ie. using manual execution)
11+
12+
### Description
13+
14+
* Runs inference on a set of prompts with continuous batching enabled (number of prompts is parametrizable)
15+
* Prints the generated text for each sequence.
16+
* All the requested sequences are defined in the beginning, there is no requests joining the waiting queue while the decoding of some other request has already started.
17+
* The exact sequence of prefill and decode steps depends on the parameter values `max_num_seqs`, `num-prompts`, `max-tokens`.
18+
* If `--compare-with-CPU` is set, then the output text is compared to the one of hugging face, running on CPU. Note that here the logprobs are not compared, only tokens.
19+
20+
### Parametrization
21+
22+
For `cb_spyre_inference.py`
23+
24+
* `--model`: the model
25+
* `--max_model_len`: maximum length of a sequence (padded prompt plus decoded tokens) (cannot exceed model size)
26+
* `--max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size)
27+
* `--tp`: Tensor parallelism (number of Spyre cards)
28+
* `--num-prompts`: Total number of requested prompts
29+
* `--max-tokens`: Number of tokens generated for each requested sequence
30+
* `--compare-with-CPU`: If set, compare the text output with CPU version running with Hugging Face instead of vLLM
31+
32+
For `long_context.py`: the same parameters, but with some differences:
33+
34+
* `--max_prompt_len`: max lengths of prompts (prompts will have length up to `max_prompt_len`)
35+
* doesn't allow to specify `--max-tokens`: number of tokens set automatically given `max_model_len` and prompts lengths
36+
37+
## CB tests through unit tests
38+
39+
!!! abstract "In Short"
40+
See the detailed description of the individual unit tests for continuous batching in their respective files directly.
41+
42+
* [Output Tests](tests/output_tests.md): Check the correctness of end output logits/tokens of sequences ran with continuous batching enabled
43+
* [Scheduler Steps Tests](tests/scheduler_steps_tests.md): Check the correctness of the step-by-step execution of continuous batching for different scenarios of prompt lengths and requested tokens
44+
* [Other Tests](tests/other_tests.md): Other tests verifying the various behaviours of vLLM, when running with continuous batching enabled
45+
46+
* **Purpose:** Automated execution to verify that a specific behaviour acts as expected (passing/failing)
47+
48+
* **Files paths:**
49+
* Output Tests: `vllm-spyre/tests/e2e/test_spyre_basic.py`
50+
* Scheduler Steps Tests: `vllm-spyre/tests/e2e/test_spyre_cb_scheduler_steps.py`
51+
* Other Tests: various files including `vllm-spyre/tests/e2e/test_spyre_cb.py`
52+
53+
<!-- markdownlint-disable MD031 MD046 -->
54+
55+
### Usage (when running locally)
56+
57+
#### Commands
58+
59+
# Runs all the tests
60+
python -m pytest -svx -m "spyre and cb" --forked tests
61+
62+
# Runs specific test file
63+
python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_cb_scheduler_steps.py
64+
65+
# Runs specific test function
66+
python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_basic.py::test_output
67+
68+
<!-- markdownlint-enable MD031 MD046 -->
69+
70+
#### Parameters description
71+
72+
* `-x` option: stops the execution as soon as a test fails
73+
* `-s` option: show all the print statements in the code
74+
* `-v` option: verbose mode, make the test output more detailed: show name of each test function and whether it passed, failed or was skipped
75+
* `--forked` option: isolates the tests and avoid having one test crashing impacting the other tests
76+
* `-m "spyre and cb"`: runs the tests with configurations marked as "spyre" and "cb" only
77+
78+
!!! tip
79+
To run a test with a different model than the default `ibm-ai-platform/micro-g3.3-8b-instruct-1b`, you can run the test with `VLLM_SPYRE_TEST_MODEL_LIST` environment variable set to the target model, for example:
80+
```bash
81+
VLLM_SPYRE_TEST_MODEL_LIST='tiny-granite-3.2-8b' python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_cb.py
82+
```
83+
84+
### Description
85+
86+
Unit tests are designed for automated and systematic execution to verify that CB behaves as expected for different scenarios. For each scenario (i.e. configuration of parameters), the test either passes or fails. When a test suite fails, identifying which specific test case failed is often more informative than the failure message itself. Below is a brief description of the different unit tests targeting CB. The description can also be found in the docstring of the different test functions:
87+
88+
!!! caution
89+
When adding new parametrization to a test, the parameters are typically combinatorial and number of executed tests can increase really fast. For example the following test function will run 2 x 2 = 4 different scenarios in total:
90+
```python
91+
@pytest.mark.parametrize("model", ["micro-g3.3-8b-instruct-1b", "granite-3.3-8b-instruct"])
92+
@pytest.mark.parametrize("max_tokens", [[10, 20], [60, 78]])
93+
def test_function(model: str, max_tokens: list[int]):
94+
...
95+
```
96+
97+
#### Output Tests
98+
99+
See [Output Tests](tests/output_tests.md)
100+
101+
Output tests checks the correctness of the output of CB on a set of prompts. For now, the number of prompts and the prompts themself are hardcoded, as well as the max requested tokens per prompt (constant and set to 20). The output from vllm is compared to this of Hugging Face on CPU.
102+
103+
!!! note inline end
104+
This applies for sendnn backend, on CPU the tokens need to additionally be exactly the same for the test to pass
105+
* The test passes if: the logprobs of HF on CPU and vLLM (on Spyre or CPU depending on the backend) are compared, and the test passes only if the pairwise relative differences of the values are all below a threshold: `math.isclose(hf_logprob, vllm_logprob, rel_tol=0.35)`. Otherwise it fails. There is no logic that takes into account the fact that the tokens might becomes different at some point, making the logits diverging.
106+
107+
#### Scheduler Steps Tests
108+
109+
See [Scheduler Steps Tests](tests/scheduler_steps_tests.md)
110+
111+
!!! Question
112+
For these tests, the final output is not checked, only the step-by-step execution correctness. Would it make sense to have output validation though?
113+
114+
Checking the final output correctness alone is not enough to ensure that CB is correctly implemented (otherwise how can we differentiate with static batching for example). So the scheduler steps tests are meant to check the correctness of the step-by-step execution of continuous batching. It does so by comparing, at every engine step (i.e. prefill or decode iteration), a bunch of attributes. This is allows a finer testing of the padding and scheduling implementation.
115+
116+
* **Checked attributes at each step:**
117+
* `tkv`: after each step, the tkv is compared against the expected tkv value for that step
118+
* `waiting`, `running`, `request_outputs`, `finished_requests` not really relevant in a compiler point of view, but after each iteration, we check that the list of running and waiting requests and those that have finished are correct. this tests the scheduler correctness.
119+
* (waiting to be merged, PR #261): `n_reserved_blocks` and `n_used_blocks`
120+
121+
#### Other Tests
122+
123+
See [Other Tests](tests/other_tests.md)
124+
125+
Most of the other tests primarily verify the correctness of various vLLM Spyre's plugin behaviors, such as launching the online server or enforcing scheduler constraints. While they don't always directly target the correctness of continuous batching, they ensure that the system functions as expected when continuous batching is enabled.

0 commit comments

Comments
 (0)