Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Aug 28, 2019

Description

enforce_max_size function incorrectly overwrote an internal TokenEmbedding datastructure, breaking the use of unknown_token's embedding for '<pad>'. When this script was written, modifying TokenEmbedding's internals was required. Based on #750 we can now use proper API to make the required changes.

In general however, vocab = nlp.Vocab(nlp.data.count_tokens(tokens)) can be replaced with vocab = nlp.Vocab(nlp.data.count_tokens(tokens), unknown_token=token_embedding_.unknown_token, padding_token=None, bos_token=None, eos_token=None). I'm opening a PR for these two changes.

Unfortunately this error slipped through the tests. I'm also extending the testcase.

Fixes #905

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Fix word_embeddings/evaluate_pretrained.py script when --analogy-max-vocab-size is used

@leezu leezu requested a review from szha as a code owner August 28, 2019 11:32
@codecov
Copy link

codecov bot commented Aug 28, 2019

Codecov Report

❗ No coverage uploaded for pull request head (fixevaluatepretrained@6224bcd). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Aug 28, 2019

Codecov Report

Merging #906 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #906      +/-   ##
==========================================
+ Coverage   90.48%   90.48%   +<.01%     
==========================================
  Files          66       66              
  Lines        6400     6401       +1     
==========================================
+ Hits         5791     5792       +1     
  Misses        609      609
Impacted Files Coverage Δ
src/gluonnlp/vocab/vocab.py 97.32% <100%> (+0.01%) ⬆️

@mli
Copy link
Member

mli commented Aug 28, 2019

Job PR-906/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-906/1/index.html

@mli
Copy link
Member

mli commented Aug 28, 2019

Job PR-906/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-906/2/index.html

@mli
Copy link
Member

mli commented Sep 2, 2019

Job PR-906/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-906/13/index.html

@szha szha requested a review from a team September 2, 2019 16:59
@szha szha merged commit b810d10 into dmlc:master Sep 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError: '<pad>' if run /scripts/word_embeddings/evaluate_pretrained.py with flag analog-max-vocab-size
3 participants