[REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts #1031

zburning · 2019-12-02T09:03:51Z

Description

Refactor the existing data preprocessing scripts(glue and squad) to make it

appropriate for both BERT and XLNet.
more scalable for future models.
more efficient in squad preprocessing

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

test

deleting trailing white space

merge from master

codecov · 2019-12-02T09:03:53Z

Codecov Report

Merging #1031 into master will decrease coverage by 0.16%.
The diff coverage is 82.35%.

@@            Coverage Diff             @@
##           master    #1031      +/-   ##
==========================================
- Coverage   88.34%   88.18%   -0.17%     
==========================================
  Files          66       66              
  Lines        6290     6306      +16     
==========================================
+ Hits         5557     5561       +4     
- Misses        733      745      +12

Impacted Files	Coverage Δ
src/gluonnlp/optimizer/__init__.py	`100% <100%> (ø)`	⬆️
src/gluonnlp/utils/version.py	`100% <100%> (ø)`	⬆️
src/gluonnlp/utils/files.py	`42.62% <18.18%> (ø)`	⬆️
src/gluonnlp/optimizer/bert_adam.py	`87.32% <81.57%> (ø)`	⬆️
src/gluonnlp/data/utils.py	`86.39% <95.34%> (ø)`	⬆️
src/gluonnlp/model/train/language_model.py	`88.51% <0%> (-5.27%)`	⬇️

leezu

Thank you! Two preliminary comments below

src/gluonnlp/data/data_preprocessing_transform.py

leezu

Thank you. Please see below comments on ConcatSeqTransform

src/gluonnlp/data/data_preprocessing_transform.py

src/gluonnlp/data/qa_preprocessing_utils.py

leezu

Thank you! Comments regarding BertTStyleSentenceTransform

src/gluonnlp/data/data_preprocessing_transform.py

leezu

Thank you!
Btw, in general please follow the https://www.python.org/dev/peps/pep-0257/ convention. In particular you typically have

class X:
"""
Content
Detail

which is against the convention and should be

class X:
"""Content

Detail

instead

src/gluonnlp/data/data_preprocessing_transform.py

leezu · 2019-12-05T06:55:51Z

src/gluonnlp/data/data_preprocessing_transform.py

+        return concat, segment_ids, p_mask
+
+
+class BertStyleGlueTransform:


Let's keep this in the scripts folder until tackling the GlueTasks. There's a dependency between both.

leezu · 2019-12-05T07:13:21Z

src/gluonnlp/data/data_preprocessing_transform.py

+            span_text = all_doc_tokens[doc_span.start:doc_span.start +
+                                       doc_span.length]
+
+            # Insert [sep]


Why not do this in one call to the ConcatSeqTransform instead of two calls? You can use functools.partial to fix the separator argument.

leezu · 2019-12-05T07:17:33Z

src/gluonnlp/data/data_preprocessing_transform.py

+        return doc_spans
+
+
+class SimpleQAPreparation:


Given this is only applicable to BERT style models, move it to a model specific file (eg gluonnlp/data/bert.py?
Alternatively, if it's a generic operation let's better describe what is done. It seems the purpose is to zip query and doc spans with the separators and update the positions accordingly? The name SimpleQAPreparation does not provide any clue on this

leezu · 2019-12-05T07:17:34Z

src/gluonnlp/data/data_preprocessing_transform.py

+        self.is_training = is_training
+        self._cls_token = cls_token
+        self._sep_token = sep_token
+        self._vocab = vocab


Why do the mapping as part of this class? Each class here should do one thing and do it well. If it does many unrelated things, it becomes hard to understand and to reuse.

leezu · 2019-12-05T07:25:43Z

src/gluonnlp/data/data_preprocessing_transform.py

+                 all_doc_tokens,
+                 max_tokens_for_doc=None,
+                 query_tokens_length=None):
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name


You shouldn't create a new _DocSpan in every call. You can do it outside.

leezu · 2019-12-05T07:39:47Z

src/gluonnlp/data/data_preprocessing_transform.py

+        return tok_start_position, tok_end_position, all_doc_tokens, query_tokens
+
+
+class DocSpanTransform:


I'm not convinced that having DocSpans to denote windows over a list of tokens is helpful. I think the code is simpler when you materialize each sliding window: Consider a document with 1000 tokens, doc = [0] * 1000. We can get the windows explicitly as window1 = doc[0:100]; window2 = doc[50:150] etc. Of course you need to calculate the window-specific start and end positions when creating the window. But this should allow removing a lot of the code where you currently have to handle DocSpan objects. Instead each window will be completely self-contained after the transform.

As each window is just a list of integers, there is very little overhead in materializing the windows. Lists of integers are highly optimized in Python and we can optimizer more with numpy if needed. I think the introduction of DocSpan here is a sign of premature optimization.

mli · 2020-01-15T12:05:31Z

Job PR-1031/25 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/25/index.html

mli · 2020-01-15T12:26:23Z

Job PR-1031/26 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/26/index.html

mli · 2020-01-15T15:02:51Z

Job PR-1031/27 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/27/index.html

scripts/bert/test_squad.py

scripts/language_model/model/qa.py

scripts/language_model/run_squad.py

scripts/language_model/xlnet_qa_evaluate.py

src/gluonnlp/data/__init__.py

scripts/language_model/run_glue.py

mli · 2020-01-16T17:24:01Z

Job PR-1031/28 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/28/index.html

mli · 2020-01-16T18:02:09Z

Job PR-1031/29 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/29/index.html

mli · 2020-01-16T18:13:16Z

Job PR-1031/30 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/30/index.html

mli · 2020-01-16T18:30:46Z

Job PR-1031/31 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/31/index.html

scripts/bert/data/preprocessing_utils.py

mli · 2020-01-17T07:05:08Z

Job PR-1031/32 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/32/index.html

scripts/language_model/index.rst

leezu · 2020-01-19T12:03:56Z

scripts/bert/finetune_classifier.py


-nlp.utils.check_version('0.8.1', warning_only=True)
+#nlp.utils.check_version('0.8.1', warning_only=True)


Change this to nlp.utils.check_version('0.9', warning_only=True)?

leezu

Thank you! A few comments but mostly fine

scripts/bert/data/preprocessing_utils.py

scripts/bert/finetune_classifier.py

scripts/language_model/run_glue.py

mli · 2020-01-20T10:05:34Z

Job PR-1031/33 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/33/index.html

mli · 2020-01-21T03:13:22Z

Job PR-1031/34 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/34/index.html

leezu · 2020-02-07T19:35:32Z

scripts/tests/test_scripts.py

@@ -330,8 +330,9 @@ def test_export(task):
 @pytest.mark.integration
 @pytest.mark.parametrize('sentencepiece', [False, True])
 def test_finetune_squad(sentencepiece):
-    arguments = ['--optimizer', 'adam', '--batch_size', '12',
-                 '--gpu', '0', '--epochs', '2', '--debug']
+    arguments = ['--optimizer', 'adam', '--batch_size', '32',


@zburning seems this test not runs 1195.36s and is the longest running test. Would it be reasonable to reduce runtime?

zburning and others added 11 commits October 25, 2019 09:58

Merge pull request #1 from dmlc/master

41e0830

test

Update transformer.py

a10c89a

Update transformer.py

23be6c6

Update transformer.py

090a6cd

Update transformer.py

358169f

deleting trailing white space

Merge pull request #2 from dmlc/master

f49b536

merge from master

Update transformer.py

278340c

Merge pull request #3 from dmlc/master

101ce13

merge from master

new features&test

b82b0e1

Merge remote-tracking branch 'upstream/master'

d9afa4d

fix

dd12bb3

zburning requested a review from a team as a code owner December 2, 2019 09:03

Wang added 4 commits December 2, 2019 17:15

fix lint

8b160d9

fix lint

6d553d6

fix lint

bf9297b

fix lint

99b8ebc

leezu reviewed Dec 2, 2019

View reviewed changes

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

Wang added 2 commits December 3, 2019 19:11

new

203e319

fix lint

1e81e55

leezu reviewed Dec 4, 2019

View reviewed changes

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

leezu reviewed Dec 4, 2019

View reviewed changes

src/gluonnlp/data/qa_preprocessing_utils.py Outdated Show resolved Hide resolved

leezu reviewed Dec 4, 2019

View reviewed changes

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

src/gluonnlp/data/data_preprocessing_transform.py Outdated Show resolved Hide resolved

Wang added 4 commits December 4, 2019 18:36

new!

983dfcb

new!

f9952c2

fix test

60574c4

fix

4456dce

leezu reviewed Dec 5, 2019

View reviewed changes

Wang added 2 commits December 6, 2019 16:44

new

4fc4588

fix

00c8ff9

fix

64d2d03

leezu reviewed Jan 15, 2020

View reviewed changes

Wang added 2 commits January 17, 2020 00:19

dropout XLNet squad & update results

d3981f6

fix pylint

ab3ba3c

leezu changed the title ~~[REFACTOR] refactor the glue&squad data preprocessing pipeline and bert&xlnet scripts~~ [REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts Jan 16, 2020

leezu reviewed Jan 16, 2020

View reviewed changes

scripts/language_model/run_glue.py Outdated Show resolved Hide resolved

Wang added 2 commits January 17, 2020 00:46

drop new finetune_squad

6826264

fix variable name

fcd807a

eric-haibin-lin reviewed Jan 16, 2020

View reviewed changes

scripts/bert/data/preprocessing_utils.py Show resolved Hide resolved

merge master

191b693

leezu reviewed Jan 19, 2020

View reviewed changes

scripts/language_model/index.rst Show resolved Hide resolved

leezu reviewed Jan 19, 2020

View reviewed changes

Wang added 3 commits January 20, 2020 17:23

add doc_string and code style

9563d32

fix style

565eeb9

fix style

74fe61c

remove the squad related part

2ce0989

leezu approved these changes Jan 21, 2020

View reviewed changes

leezu merged commit 5fe8d7b into dmlc:master Jan 22, 2020

leezu reviewed Feb 7, 2020

View reviewed changes

		return concat, segment_ids, p_mask


		class BertStyleGlueTransform:

		return tok_start_position, tok_end_position, all_doc_tokens, query_tokens


		class DocSpanTransform:


		nlp.utils.check_version('0.8.1', warning_only=True)
		#nlp.utils.check_version('0.8.1', warning_only=True)

[REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts #1031

[REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts #1031

Uh oh!

Conversation

zburning commented Dec 2, 2019

Description

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mli commented Jan 15, 2020

Uh oh!

mli commented Jan 15, 2020

Uh oh!

mli commented Jan 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mli commented Jan 16, 2020

Uh oh!

mli commented Jan 16, 2020

Uh oh!

mli commented Jan 16, 2020

Uh oh!

mli commented Jan 16, 2020

Uh oh!

Uh oh!

mli commented Jan 17, 2020

Uh oh!

Uh oh!

codecov bot commented Dec 2, 2019 •

edited

Loading