Skip to content

Add DeepSpeed MII backend to benchmark script #1649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 14, 2023
Merged

Add DeepSpeed MII backend to benchmark script #1649

merged 5 commits into from
Nov 14, 2023

Conversation

WoosukKwon
Copy link
Collaborator

This PR adds the DeepSpeed-MII backend to benchmark_throughput.py. The script uses MII's non-persistent pipeline API.

@WoosukKwon WoosukKwon requested review from zhuohan123, simon-mo and LiuXiaoxuanPKU and removed request for simon-mo November 14, 2023 00:02
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some small comments.

default="vllm")
parser.add_argument("--dataset",
type=str,
required=True,
default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this line changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's because users can set the fixed input and output lengths instead of providing a dataset.

@WoosukKwon WoosukKwon merged commit 660a7fc into main Nov 14, 2023
@WoosukKwon WoosukKwon deleted the mii branch November 14, 2023 20:35
args.tokenizer, trust_remote_code=args.trust_remote_code)
if args.dataset is None:
# Synthesize a prompt with the given input length.
prompt = "hi" * (args.input_len - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be " ".join(["hi"] * args.input_len)? In general, how can you make sure the prompt you generate has the number of tokens you specified with a bunch of "hi"s?

Copy link
Collaborator Author

@WoosukKwon WoosukKwon Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it's a bit hacky. However, I found this worked for LLaMA and OPT because "hi" is a single token in their tokenizers and "hi" * n is split into n "hi" tokens.

@RezaYazdaniAminabadi
Copy link

Hi @WoosukKwon

I am trying to repro your result and I get to the following error when running wiht vllm backend:

  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 205, in forward               
    hidden_states = self.self_attn(                                                                                                                                       
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 150, in forward               
    attn_output = self.attn(positions, q, k, v, k_cache, v_cache,                                                                                                         
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                            
    return forward_call(*args, **kwargs)                                                                                                                                  
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 359, in forward           
    return super().forward(                                                                                                                                               
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 254, in forward           
    self.multi_query_kv_attention(                                                                                                                                        
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/vllm-0.2.2+cu117-py3.8-linux-x86_64.egg/vllm/model_executor/layers/attention.py", line 109, in multi_query_kv_att
ention                                                                                                                                                                    
    out = xops.memory_efficient_attention_forward(                                                                                                                        
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 244, in memory_efficient_attention_forward                                  
    return _memory_efficient_attention_forward(                                                                                                                           
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 337, in _memory_efficient_attention_forward                                 
    op = _dispatch_fw(inp, False)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/dispatch.py", line 120, in _dispatch_fw
    return _run_priority_list(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/xformers/ops/fmha/dispatch.py", line 63, in _run_priority_list
    raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 4096, 32, 128) (torch.float16)
     key         : shape=(1, 4096, 32, 128) (torch.float16)
     value       : shape=(1, 4096, 32, 128) (torch.float16)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
     p           : 0.0

Do you have any idea how I can resolve this?
Thanks,
Reza

yxl pushed a commit to yxl/vllm that referenced this pull request Nov 29, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants