Add DeepSpeed MII backend to benchmark script #1649

WoosukKwon · 2023-11-14T00:01:32Z

This PR adds the DeepSpeed-MII backend to benchmark_throughput.py. The script uses MII's non-persistent pipeline API.

zhuohan123

LGTM! Left some small comments.

benchmarks/benchmark_throughput.py

zhuohan123 · 2023-11-14T20:30:17Z

benchmarks/benchmark_throughput.py

                        default="vllm")
    parser.add_argument("--dataset",
                        type=str,
-                        required=True,
+                        default=None,


Why does this line changed?

Oh it's because users can set the fixed input and output lengths instead of providing a dataset.

zhuohan123 · 2023-11-14T20:35:32Z

benchmarks/benchmark_throughput.py

+        args.tokenizer, trust_remote_code=args.trust_remote_code)
+    if args.dataset is None:
+        # Synthesize a prompt with the given input length.
+        prompt = "hi" * (args.input_len - 1)


Should this line be " ".join(["hi"] * args.input_len)? In general, how can you make sure the prompt you generate has the number of tokens you specified with a bunch of "hi"s?

Yeah I agree it's a bit hacky. However, I found this worked for LLaMA and OPT because "hi" is a single token in their tokenizers and "hi" * n is split into n "hi" tokens.

RezaYazdaniAminabadi · 2023-11-29T07:08:43Z

Hi @WoosukKwon

I am trying to repro your result and I get to the following error when running wiht vllm backend:

  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 205, in forward               
    hidden_states = self.self_attn(                                                                                                                                       
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 150, in forward               
    attn_output = self.attn(positions, q, k, v, k_cache, v_cache,                                                                                                         
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 359, in forward           
    return super().forward(                                                                                                                                               
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 254, in forward           
    self.multi_query_kv_attention(                                                                                                                                        
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 109, in multi_query_kv_att
ention                                                                                                                                                                    
    out = xops.memory_efficient_attention_forward(                                                                                                                        
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 244, in memory_efficient_attention_forward                                  
    return _memory_efficient_attention_forward(                                                                                                                           
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 337, in _memory_efficient_attention_forward                                 
    op = _dispatch_fw(inp, False)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/dispatch.py", line 120, in _dispatch_fw
    return _run_priority_list(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/dispatch.py", line 63, in _run_priority_list
    raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 4096, 32, 128) (torch.float16)
     key         : shape=(1, 4096, 32, 128) (torch.float16)
     value       : shape=(1, 4096, 32, 128) (torch.float16)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
     p           : 0.0

Do you have any idea how I can resolve this?
Thanks,
Reza

WoosukKwon added 4 commits November 13, 2023 23:55

Add MII backend

a026d33

yapf

0c745ee

Minor

b11fb0e

add TP

f289b4a

WoosukKwon requested review from zhuohan123, simon-mo and LiuXiaoxuanPKU and removed request for simon-mo November 14, 2023 00:02

zhuohan123 approved these changes Nov 14, 2023

View reviewed changes

uniform -> fixed

7e7609c

WoosukKwon merged commit 660a7fc into main Nov 14, 2023

WoosukKwon deleted the mii branch November 14, 2023 20:35

zhuohan123 reviewed Nov 14, 2023

View reviewed changes

yxl pushed a commit to yxl/vllm that referenced this pull request Nov 29, 2023

Add DeepSpeed MII backend to benchmark script (vllm-project#1649)

9b1c7ac

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add DeepSpeed MII backend to benchmark script (vllm-project#1649)

3a632f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add DeepSpeed MII backend to benchmark script #1649

Add DeepSpeed MII backend to benchmark script #1649

Uh oh!

WoosukKwon commented Nov 14, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

Uh oh!

zhuohan123 Nov 14, 2023

Uh oh!

WoosukKwon Nov 14, 2023

Uh oh!

zhuohan123 Nov 14, 2023

Uh oh!

WoosukKwon Nov 14, 2023 •

edited

Loading

Uh oh!

RezaYazdaniAminabadi commented Nov 29, 2023

Uh oh!

Uh oh!

Uh oh!

Add DeepSpeed MII backend to benchmark script #1649

Add DeepSpeed MII backend to benchmark script #1649

Uh oh!

Conversation

WoosukKwon commented Nov 14, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuohan123 Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RezaYazdaniAminabadi commented Nov 29, 2023

Uh oh!

Uh oh!

WoosukKwon Nov 14, 2023 •

edited

Loading