🚨🚨🚨 Optimize Top P Sampler and fix edge case #18984

ekagra-ranjan · 2022-09-12T11:35:17Z

What does this PR do?

This PR does the following:

Fixes Top_P sampling samples an extra token when the cum sum of probabilities is exactly equal to top_p #18976
Optimizes the Top P sampler Pytorch implementation by removing the need to clone an intermediate tensor and shifting things to right.
Add edge case test to PT, TF, FLAX

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@gante @patrickvonplaten

HuggingFaceDocBuilderDev · 2022-09-12T11:48:42Z

The documentation is not available anymore as the PR was closed or merged.

gante

LGTM, will get my approval when the test change also gets added to TF and FLAX 👍

ekagra-ranjan · 2022-09-12T12:11:54Z

@gante the proposed PT implementation passes the edge case. I also added the edge case locally and verified that the existing FLAX implementation passes the edge case with no change required in its implementation.

However, the TF implementation passes the edge case when use_xla is True but fails when it is false in my local machine. Hence, I reverted the addition of edge case to TF and FLAX in my PR. It seems that the behavior changes when using xla for TF.

Can you please confirm if just replacing 0.7 with 0.8 in this test succeeds in your local machine?

transformers/tests/generation/test_generation_tf_logits_process.py

Line 192 in a86acb7

top_p_warp = TFTopPLogitsWarper(0.7)

ekagra-ranjan · 2022-09-12T13:06:52Z

I was investigating on TF's behavior and found this:

This is the input distribution to the test:

transformers/tests/generation/test_generation_tf_logits_process.py

Line 190 in a86acb7

    
           dist = np.log(np.array([[0.3, 0.1, 0.1, 0.5], [0.15, 0.3, 0.3, 0.25]], dtype=np.float32))

The above goes to TFTopPLogitsWrapper which takes a cumsum here:

transformers/src/transformers/generation_tf_logits_process.py

Line 173 in a86acb7

    
           cumulative_probs = tf.math.cumsum(stable_softmax(topk_scores, axis=-1), axis=-1)

This cumulative_probs gets different value for use_xla as True or False in the unittest.

When use_xla is True then cumulative_probs is [[0.5, 0.8, 0.90000004, 1.],
[0.29999998, 0.59999996, 0.8499999 , 0.99999994]]
When use_xla is False then cumulative_probs is [[0.5, 0.79999995, 0.9, 1. ],
[0.3, 0.6, 0.85, 1. ]

This is causing an extra sample to get be sampled in the 1st batch when use_xla is False as 0.79999995 is < 0.8.

How should we proceed forward? This issue of changing behavior is not there in PT and FLAX so should we go ahead with just PT and FLAX for this PR and raise this as a separate TF issue in transformers repo?

gante · 2022-09-12T13:57:39Z

@ekagra-ranjan we could add an if/else depending on whether use_xla is True or not, and set top_p to 0.8 or 0.79999995 accordingly.

However, since this edge case has such low impact in practice, it's okay if we take the simpler path and simply set top_p to 0.79999995. It won't test the edge case with XLA, but at least it is tested once (with eager execution, i.e. with use_xla=False).

P.S.: TF's softmax is known to have these minor numerical instabilities.

ekagra-ranjan · 2022-09-12T14:35:57Z

@gante Thank you for your reviews! Edge case test for FLAX and TF have been added and are passing

gante

LGTM 👍 Thanks for addressing all the comments!

patrickvonplaten · 2022-09-13T23:29:39Z

src/transformers/generation_logits_process.py

Isn't <= always a bit dangerous with float values? I'm not sure we can assure 100% backward compatibility here

Slightly worried about that we'll silently break someone's PyTorch generation code that uses top_p by default here

@patrickvonplaten there is indeed a change at the edge case -- before, if top_p was 0.8 and the input was [0.5, 0.3, 0.1, 0.1], the first three tokens would pass this filter, despite the first two summing up to 0.8 (and thus satisfying the top P conditions, according to the original paper and our docstrings).

The behavior in TF and FLAX satisfies the edge case above, while PT does not. In practice, the impact will be negligible (this change filters one additional token when the sum of the logits is exactly top_p), although it can change seeded test cases.

Alternatively, we can change our docstrings (and TF+FLAX's implementation) to ignore this edge case :D

@patrickvonplaten I believe you are referring to the floating point precision in the context of <= being dangerous with float value. The Top P sampler intends to pick minimum elements which have cumulative dist >= top_p. So either we use the equality while selecting the mask or ignore it and then shift the mask to right/left.

The proposed PT implementation uses <= but it can be implemented in the same manner as TF and FLAX which do not have the equality operator explicitly but will need to clone a tensor and shifting values to right/left. This however will not prevent the issue of floating point precision.

E.g., if we take input as [0.5, 0.3, 0.1, 0.1] and top_p as 0.8 then according to this:

transformers/src/transformers/generation_flax_logits_process.py

Lines 142 to 146 in 693ba2c

score_mask = cumulative_probs < self.top_p

# include the token that is higher than top_p as well

score_mask = jnp.roll(score_mask, 1)

score_mask |= score_mask.at[:, 0].set(True)

the cumulative_probs could be [0.5, 0.79995, 0.1, 0.1] due to floating point precision which will lead to Top P sampler picking 1st three elements instead of 1st two even though there is no equality operator involved.

Thanks for the explanations @gante and @ekagra-ranjan - this makes sense to me!

Given the very high usage of generate and top_p we need to clearly mark this as a "breaking behavior bug fix" with 🚨🚨🚨 in the PR description and also make sure it's mentioned in our release notes (cc @LysandreJik )

But good for merge then for me

patrickvonplaten

Let's just make sure users that have their generation pipeline give different results find this PR here as the explanation. It's a backwards breaking bug change for me that however might affect quite some generation pipelines. IMO it's ok to merge with a big warning - wdyt @LysandreJik @sgugger ?

sgugger

I don't have a problem with fixing the behavior to match the documentation.

For the PR before merging, could you?

run make style to fix the quality issue
rebase on main, which should take care of the tests failures we see

ekagra-ranjan · 2022-09-14T21:11:38Z

@sgugger Sure, done.

LysandreJik

Looks good to me!

Will make sure this is very visible in the release notes, thank you for the 🚨

ekagra-ranjan mentioned this pull request Sep 12, 2022

Top_P sampling samples an extra token when the cum sum of probabilities is exactly equal to top_p #18976

Closed

4 tasks

ekagra-ranjan changed the title ~~Optimize Top P Sampler~~ Optimize Top P Sampler and fix edge case Sep 12, 2022

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case~~ Optimize Top P Sampler and fix edge case (pytorch) Sep 12, 2022

gante mentioned this pull request Sep 12, 2022

Generation: fix TopPLogitsWarper edge case #18983

Closed

gante requested review from gante and patrickvonplaten September 12, 2022 12:06

gante reviewed Sep 12, 2022

View reviewed changes

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case (pytorch)~~ Optimize Top P Sampler and fix edge case Sep 12, 2022

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case~~ Optimize Top P Sampler and fix edge case for Pytorch Sep 12, 2022

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case for Pytorch~~ Optimize Top P Sampler and fix edge case Sep 12, 2022

gante approved these changes Sep 12, 2022

View reviewed changes

patrickvonplaten reviewed Sep 13, 2022

View reviewed changes

patrickvonplaten approved these changes Sep 14, 2022

View reviewed changes

sgugger reviewed Sep 14, 2022

View reviewed changes

ekagra-ranjan added 8 commits September 15, 2022 02:21

init PR

f25b527

optimize top p and add edge case

7dd916a

styling

b1cff0a

style

485303e

revert tf and flax test

644b9c2

add edge case test for FLAX and TF

9b2ba23

update doc with smallest set sampling for top p

2457d47

make style

8854977

ekagra-ranjan force-pushed the fix/optimize-top-p-warper branch from 6f25001 to 8854977 Compare September 14, 2022 20:52

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case~~ 🚨🚨🚨 Optimize Top P Sampler and fix edge case Sep 14, 2022

ekagra-ranjan changed the title ~~🚨🚨🚨 Optimize Top P Sampler and fix edge case~~ Optimize Top P Sampler and fix edge case Sep 14, 2022

ekagra-ranjan changed the title ~~Optimize Top P Sampler and fix edge case~~ 🚨🚨🚨 Optimize Top P Sampler and fix edge case Sep 14, 2022

LysandreJik approved these changes Sep 15, 2022

View reviewed changes

LysandreJik merged commit 578e18e into huggingface:main Sep 15, 2022

	score_mask = cumulative_probs < self.top_p

	# include the token that is higher than top_p as well
	score_mask = jnp.roll(score_mask, 1)
	score_mask \|= score_mask.at[:, 0].set(True)

🚨🚨🚨 Optimize Top P Sampler and fix edge case #18984

🚨🚨🚨 Optimize Top P Sampler and fix edge case #18984

Uh oh!

Conversation

ekagra-ranjan commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan commented Sep 12, 2022

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 13, 2022

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 13, 2022

Choose a reason for hiding this comment

Uh oh!

gante Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan commented Sep 14, 2022

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ekagra-ranjan commented Sep 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 12, 2022 •

edited

Loading

ekagra-ranjan commented Sep 12, 2022 •

edited

Loading

ekagra-ranjan commented Sep 12, 2022 •

edited

Loading

gante commented Sep 12, 2022 •

edited

Loading

ekagra-ranjan Sep 14, 2022 •

edited

Loading