[LLM Inference] Refactor BlockInferencePredictor #8879

yuanlehome · 2024-08-06T09:11:39Z

PR types

Others

PR changes

Others

Description

重构一些代码以及修复一些bug。

特别需要提示的是，静态图推理时指定了src_length和max_length的话，动转静的时候同样也需要指定。

paddle-bot · 2024-08-06T09:11:44Z

Thanks for your contribution!

codecov · 2024-08-06T09:43:54Z

Codecov Report

Attention: Patch coverage is 5.55556% with 17 lines in your changes missing coverage. Please review.

Project coverage is 55.40%. Comparing base (678843e) to head (88ea827).
Report is 246 commits behind head on develop.

Files with missing lines	Patch %	Lines
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	11 Missing ⚠️
paddlenlp/experimental/model_utils.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8879      +/-   ##
===========================================
+ Coverage    55.29%   55.40%   +0.10%     
===========================================
  Files          631      632       +1     
  Lines        98888    99762     +874     
===========================================
+ Hits         54681    55271     +590     
- Misses       44207    44491     +284

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nto support_llama3

yuanlehome · 2024-08-08T08:57:09Z

llm/predict/export_model.py

    )
    predictor.model.config.save_pretrained(export_args.output_path)
-    predictor.model.generation_config.save_pretrained(export_args.output_path)
+    if predictor.generation_config is not None:


修复generation_config.json保存不正确的bug

yuanlehome · 2024-08-08T08:57:27Z

csrc/generation/get_padding_offset_v2.cu


 PD_BUILD_OP(get_padding_offset_v2)
-    .Inputs({"input_ids", "token_num", "cum_offsets", "seq_len"})
+    .Inputs({"input_ids", "cum_offsets", "token_num", "seq_len"})


修复算子输入位置错乱问题

很好奇之前怎么能跑通

影响name顺序而已，tensor顺序没有错

yuanlehome · 2024-08-08T08:57:49Z

csrc/generation/encode_rotary_qk.cu

            seq_len * rotary_emb_dims,
            last_dim);
-        NeoXRotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+        NeoXRotaryKernel<<<grid_k, BlockSize, 0, cu_stream>>>(


修复gqa kernel计算bug

yuanlehome · 2024-08-08T08:59:03Z

llm/predict/predictor.py

-    get_default_max_decoding_length,
-    get_default_max_encoding_length,


删掉自动配置src_length和max_length逻辑，通过给src_length和max_length指定默认值，因为自动配置并不合理不能兼容所有模型，某些情况下会报错

DesmonDay · 2024-08-08T09:03:46Z

llm/utils/utils.py

    tokenizer.init_chat_template(chat_template_file)


-def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]:


这些函数不要删掉，我看代码库别的地方也有调用的，例如ppo之类的。

DesmonDay · 2024-08-08T09:22:49Z

llm/predict/predictor.py

-class InferencePredictorMixin:
+class InferencePredictorMixin(BasePredictor):
    def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer):
+        BasePredictor.__init__(self, config, tokenizer)


所以generation_config就通过BasePredictor来生成是吧

DesmonDay · 2024-08-08T09:27:18Z

llm/predict/predictor.py

+        self.input_ids = paddle.full(
+            shape=[config.batch_size, config.total_max_length], fill_value=self.tokenizer.pad_token_id, dtype="int64"
+        )
+        self.model_inputs = {}


变量名从 self.inputs 改为了 self.model_inputs，会影响什么逻辑吗？

DesmonDay · 2024-08-08T09:30:25Z

llm/predict/predictor.py

-            length = len(input_ids)
-            self.inputs["input_ids"][i : i + 1, :length] = input_ids
-            self.inputs["penalty_score"][i : i + 1] = self.config.repetition_penalty
-            self.inputs["frequency_score"][i : i + 1] = 0.0


这些不需要保留么？

还是说前面 init_model_inputs 写了，这里就是把重复的删掉了

对的，不需要保留

DesmonDay · 2024-08-08T09:33:43Z

paddlenlp/experimental/transformers/llama/modeling.py

-            ffn1_weight_tensor = paddle.to_tensor(concated_ffn1_weight)

-            qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight)
+            qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight).cast(paddle.get_default_dtype())


为啥要cast到 default dtype

为了同时能跑bf16/fp16

DesmonDay · 2024-08-08T09:34:01Z

paddlenlp/experimental/transformers/llama/modeling.py

-            linear_weight_tensor = paddle.to_tensor(state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)])
+            linear_weight_tensor = paddle.to_tensor(
+                state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)]
+            ).cast(paddle.get_default_dtype())


DesmonDay · 2024-08-08T09:36:41Z

llm/predict/predictor.py

-    src_length: int = field(default=None, metadata={"help": "The max length of source text."})
-    max_length: int = field(default=None, metadata={"help": "the max length for decoding."})
+    src_length: int = field(default=4096, metadata={"help": "The max length of source text."})
+    min_length: int = field(default=1, metadata={"help": "the min length for decoding."})


min_length和max_length，名字改成 min_decode_length， max_decode_length吧，不然容易引起误会。

改不动，太多地方用到这个名字了ooo

DesmonDay · 2024-08-08T09:37:02Z

tests/llm/test_predictor.py


    def test_blha(self):
-        self.run_predictor({"inference_model": True, "block_attn": True})
+        self.run_predictor({"inference_model": True, "block_attn": True, "src_length": 1024, "max_length": 48})


max_length改成max_decode_length

vivienfanghuagood · 2024-08-08T09:54:46Z

llm/predict/predictor.py

+            shape=[config.batch_size, 1], fill_value=config.temperature, dtype="float32"
+        )
+        self.model_inputs["eos_token_id"] = paddle.to_tensor(
+            np.array(get_eos_token_id(self.tokenizer, self.generation_config)).reshape(-1, 1).astype("int64")


这个在多batch下不会有问题吗？

不会的，kernel处理的时候不分batch的

wawltor

LGTM

stage 1

e410e4a

yuanlehome mentioned this pull request Aug 6, 2024

TEMP #8855

Closed

yuanlehome added 4 commits August 6, 2024 14:54

update

6f9d819

update

244802a

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

bc75f59

…nto support_llama3

fix ci

147f2c1

yuanlehome changed the title ~~TEMP~~ [LLM Inference] Refactor BlockInferencePredictor Aug 8, 2024

yuanlehome commented Aug 8, 2024

View reviewed changes

DesmonDay reviewed Aug 8, 2024

View reviewed changes

vivienfanghuagood reviewed Aug 8, 2024

View reviewed changes

fix ci

0946ab9

yuanlehome force-pushed the support_llama3 branch from 87f5510 to 0946ab9 Compare August 8, 2024 12:30

fix ut

88ea827

wawltor approved these changes Aug 12, 2024

View reviewed changes

wawltor merged commit 5bc040a into PaddlePaddle:develop Aug 12, 2024

		get_default_max_decoding_length,
		get_default_max_encoding_length,

		tokenizer.init_chat_template(chat_template_file)


		def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]:

[LLM Inference] Refactor BlockInferencePredictor #8879

[LLM Inference] Refactor BlockInferencePredictor #8879

Uh oh!

Conversation

yuanlehome commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Aug 6, 2024

Uh oh!

codecov bot commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wawltor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuanlehome commented Aug 6, 2024 •

edited

Loading

codecov bot commented Aug 6, 2024 •

edited

Loading