|
| 1 | +# Continuous Batching tests / inference scripts in vLLM |
| 2 | + |
| 3 | +Brief overview of what has been implemented so far in VLLM to test / debug continuous batching |
| 4 | + |
| 5 | +## Inference script |
| 6 | + |
| 7 | +* **File paths:** |
| 8 | + * `examples/offline_inference/cb_spyre_inference.py` |
| 9 | + * `examples/offline_inference/long_context.py` |
| 10 | +* **Purpose:** Debugging (ie. using manual execution) |
| 11 | + |
| 12 | +### Description |
| 13 | + |
| 14 | +* Runs inference on a set of prompts with continuous batching enabled (number of prompts is parametrizable) |
| 15 | +* Prints the generated text for each sequence. |
| 16 | +* All the requested sequences are defined in the beginning, there is no requests joining the waiting queue while the decoding of some other request has already started. |
| 17 | +* The exact sequence of prefill and decode steps depends on the parameter values `max_num_seqs`, `num-prompts`, `max-tokens`. |
| 18 | +* If `--compare-with-CPU` is set, then the output text is compared to the one of hugging face, running on CPU. Note that here the logprobs are not compared, only tokens. |
| 19 | + |
| 20 | +### Parametrization |
| 21 | + |
| 22 | +For `cb_spyre_inference.py` |
| 23 | + |
| 24 | +* `--model`: the model |
| 25 | +* `--max_model_len`: maximum length of a sequence (padded prompt plus decoded tokens) (cannot exceed model size) |
| 26 | +* `--max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size) |
| 27 | +* `--tp`: Tensor parallelism (number of Spyre cards) |
| 28 | +* `--num-prompts`: Total number of requested prompts |
| 29 | +* `--max-tokens`: Number of tokens generated for each requested sequence |
| 30 | +* `--compare-with-CPU`: If set, compare the text output with CPU version running with Hugging Face instead of vLLM |
| 31 | + |
| 32 | +For `long_context.py`: the same parameters, but with some differences: |
| 33 | + |
| 34 | +* `--max_prompt_len`: max lengths of prompts (prompts will have length up to `max_prompt_len`) |
| 35 | +* doesn't allow to specify `--max-tokens`: number of tokens set automatically given `max_model_len` and prompts lengths |
| 36 | + |
| 37 | +## CB tests through unit tests |
| 38 | + |
| 39 | +!!! abstract "In Short" |
| 40 | + See the detailed description of the individual unit tests for continuous batching in their respective files directly. |
| 41 | + |
| 42 | + * [Output Tests](tests/output_tests.md): Check the correctness of end output logits/tokens of sequences ran with continuous batching enabled |
| 43 | + * [Scheduler Steps Tests](tests/scheduler_steps_tests.md): Check the correctness of the step-by-step execution of continuous batching for different scenarios of prompt lengths and requested tokens |
| 44 | + * [Other Tests](tests/other_tests.md): Other tests verifying the various behaviours of vLLM, when running with continuous batching enabled |
| 45 | + |
| 46 | +* **Purpose:** Automated execution to verify that a specific behaviour acts as expected (passing/failing) |
| 47 | + |
| 48 | +* **Files paths:** |
| 49 | + * Output Tests: `vllm-spyre/tests/e2e/test_spyre_basic.py` |
| 50 | + * Scheduler Steps Tests: `vllm-spyre/tests/e2e/test_spyre_cb_scheduler_steps.py` |
| 51 | + * Other Tests: various files including `vllm-spyre/tests/e2e/test_spyre_cb.py` |
| 52 | + |
| 53 | +<!-- markdownlint-disable MD031 MD046 --> |
| 54 | + |
| 55 | +### Usage (when running locally) |
| 56 | + |
| 57 | +#### Commands |
| 58 | + |
| 59 | + # Runs all the tests |
| 60 | + python -m pytest -svx -m "spyre and cb" --forked tests |
| 61 | + |
| 62 | + # Runs specific test file |
| 63 | + python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_cb_scheduler_steps.py |
| 64 | + |
| 65 | + # Runs specific test function |
| 66 | + python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_basic.py::test_output |
| 67 | + |
| 68 | +<!-- markdownlint-enable MD031 MD046 --> |
| 69 | + |
| 70 | +#### Parameters description |
| 71 | + |
| 72 | +* `-x` option: stops the execution as soon as a test fails |
| 73 | +* `-s` option: show all the print statements in the code |
| 74 | +* `-v` option: verbose mode, make the test output more detailed: show name of each test function and whether it passed, failed or was skipped |
| 75 | +* `--forked` option: isolates the tests and avoid having one test crashing impacting the other tests |
| 76 | +* `-m "spyre and cb"`: runs the tests with configurations marked as "spyre" and "cb" only |
| 77 | + |
| 78 | +!!! tip |
| 79 | + To run a test with a different model than the default `ibm-ai-platform/micro-g3.3-8b-instruct-1b`, you can run the test with `VLLM_SPYRE_TEST_MODEL_LIST` environment variable set to the target model, for example: |
| 80 | + ```bash |
| 81 | + VLLM_SPYRE_TEST_MODEL_LIST='tiny-granite-3.2-8b' python -m pytest -svx -m "spyre and cb" --forked tests/e2e/test_spyre_cb.py |
| 82 | + ``` |
| 83 | + |
| 84 | +### Description |
| 85 | + |
| 86 | +Unit tests are designed for automated and systematic execution to verify that CB behaves as expected for different scenarios. For each scenario (i.e. configuration of parameters), the test either passes or fails. When a test suite fails, identifying which specific test case failed is often more informative than the failure message itself. Below is a brief description of the different unit tests targeting CB. The description can also be found in the docstring of the different test functions: |
| 87 | + |
| 88 | +!!! caution |
| 89 | + When adding new parametrization to a test, the parameters are typically combinatorial and number of executed tests can increase really fast. For example the following test function will run 2 x 2 = 4 different scenarios in total: |
| 90 | + ```python |
| 91 | + @pytest.mark.parametrize("model", ["micro-g3.3-8b-instruct-1b", "granite-3.3-8b-instruct"]) |
| 92 | + @pytest.mark.parametrize("max_tokens", [[10, 20], [60, 78]]) |
| 93 | + def test_function(model: str, max_tokens: list[int]): |
| 94 | + ... |
| 95 | + ``` |
| 96 | + |
| 97 | +#### Output Tests |
| 98 | + |
| 99 | +See [Output Tests](tests/output_tests.md) |
| 100 | + |
| 101 | +Output tests checks the correctness of the output of CB on a set of prompts. For now, the number of prompts and the prompts themself are hardcoded, as well as the max requested tokens per prompt (constant and set to 20). The output from vllm is compared to this of Hugging Face on CPU. |
| 102 | + |
| 103 | +!!! note inline end |
| 104 | + This applies for sendnn backend, on CPU the tokens need to additionally be exactly the same for the test to pass |
| 105 | +* The test passes if: the logprobs of HF on CPU and vLLM (on Spyre or CPU depending on the backend) are compared, and the test passes only if the pairwise relative differences of the values are all below a threshold: `math.isclose(hf_logprob, vllm_logprob, rel_tol=0.35)`. Otherwise it fails. There is no logic that takes into account the fact that the tokens might becomes different at some point, making the logits diverging. |
| 106 | + |
| 107 | +#### Scheduler Steps Tests |
| 108 | + |
| 109 | +See [Scheduler Steps Tests](tests/scheduler_steps_tests.md) |
| 110 | + |
| 111 | +!!! Question |
| 112 | + For these tests, the final output is not checked, only the step-by-step execution correctness. Would it make sense to have output validation though? |
| 113 | + |
| 114 | +Checking the final output correctness alone is not enough to ensure that CB is correctly implemented (otherwise how can we differentiate with static batching for example). So the scheduler steps tests are meant to check the correctness of the step-by-step execution of continuous batching. It does so by comparing, at every engine step (i.e. prefill or decode iteration), a bunch of attributes. This is allows a finer testing of the padding and scheduling implementation. |
| 115 | + |
| 116 | +* **Checked attributes at each step:** |
| 117 | + * `tkv`: after each step, the tkv is compared against the expected tkv value for that step |
| 118 | + * `waiting`, `running`, `request_outputs`, `finished_requests` not really relevant in a compiler point of view, but after each iteration, we check that the list of running and waiting requests and those that have finished are correct. this tests the scheduler correctness. |
| 119 | + * (waiting to be merged, PR #261): `n_reserved_blocks` and `n_used_blocks` |
| 120 | + |
| 121 | +#### Other Tests |
| 122 | + |
| 123 | +See [Other Tests](tests/other_tests.md) |
| 124 | + |
| 125 | +Most of the other tests primarily verify the correctness of various vLLM Spyre's plugin behaviors, such as launching the online server or enforcing scheduler constraints. While they don't always directly target the correctness of continuous batching, they ensure that the system functions as expected when continuous batching is enabled. |
0 commit comments