Skip to content

Conversation

@tzielinski-habana
Copy link

@tzielinski-habana tzielinski-habana commented Nov 29, 2024

Moving sin/cos buffers preparation for rope outside model forward boosts performance by getting rid
of unneccessary gather and memcopy ops before rope

@tzielinski-habana
Copy link
Author

Let's not merge yet, I've found some problems with TP=2 with fp8

@tzielinski-habana tzielinski-habana changed the title Prepare sin/cos buffers for rope outside model forward [DO NOT MERGE] Prepare sin/cos buffers for rope outside model forward Nov 29, 2024
@tzielinski-habana
Copy link
Author

Actually, it looks like this problem also occurs on habana_main

@tzielinski-habana tzielinski-habana changed the title [DO NOT MERGE] Prepare sin/cos buffers for rope outside model forward Prepare sin/cos buffers for rope outside model forward Dec 2, 2024
@tzielinski-habana tzielinski-habana changed the title Prepare sin/cos buffers for rope outside model forward [DO NOT MERGE] Prepare sin/cos buffers for rope outside model forward Dec 2, 2024
@tzielinski-habana tzielinski-habana force-pushed the rope_improvements branch 2 times, most recently from aeeb21d to fd9c667 Compare December 3, 2024 16:03
@tzielinski-habana tzielinski-habana changed the title [DO NOT MERGE] Prepare sin/cos buffers for rope outside model forward Prepare sin/cos buffers for rope outside model forward Dec 4, 2024
@tzielinski-habana tzielinski-habana merged commit 8c76728 into habana_main Dec 4, 2024
10 checks passed
@tzielinski-habana tzielinski-habana deleted the rope_improvements branch December 4, 2024 12:44
adobrzyn added a commit that referenced this pull request Dec 11, 2024
vivekgoe pushed a commit that referenced this pull request Jan 2, 2025
#566 breaks long-contexts +
LoRA flow.

This assumes caching sin-cos buffer for first decoder layer is
sufficient to handle all cases, which is not the applicable for
long-context + LoRA.

This PR ignores `_prepare_cos_sin` call prior to HpuModelAdapter forward
in long-context + LoRA flow.
michalkuligowski pushed a commit that referenced this pull request Jan 7, 2025
Error reported in https://jira.habana-labs.com/browse/SW-212516

Found two recent merged PR breaks down Spec Decode functionality:

1. #491 overrides existing
workerwrapperBase design for speculative decoding.
```
if model_runner_cls is not None:
    ModelRunnerClass = model_runner_cls
```
is not needed since we now use codes as below for init model_runner_cls
to follow upstream design.
```
if model_runner_cls is not None:
            self.model_runner = model_runner_cls(self.model_runner)
```

2. #566 is not working in Spec
Decode Eagle mode
Due to input tensors is now different to the pre-assumption that
decode_fwd only provide one token per seq. Spec Decode provides multiple
candidates tokens as q.
To fix that, added a new ENV - "**VLLM_COS_SIN_RECOMPUTE**=true", need
to use it to trigger recompute to cos and sin for spec decode.

---------

Signed-off-by: Chendi.Xue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants