-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
any cuda environment has this bug
🐛 Describe the bug
I built a KV connector based on the v1 KV connector API.
This connector starts a background thread in each worker process. After the main thread calls save_kv_layer
, the background thread tries to move data from the GPU memory to the system memory using a special CUDA stream (swap_out_stream).
Here’s a simplified version of the logic:
async def run_in_background(self, blk_ids, kv_cache_layer):
with torch.cuda.stream(self.swap_out_stream): # Using a dedicated CUDA stream
host_memory = get_available_system_memory() # Find space in system memory
ops.swaps_out(kv_cache_layer, host_mem, blk_ids)
event = # get cuda event
event.record()
while not event.query():
await asyncio.sleep(0)
When running vLLM with tpsize=4, the model’s intermediate states sometimes contain invalid values (NaN), which leads to incorrect outputs.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working