Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

zburning
Copy link
Contributor

@zburning zburning commented Dec 2, 2019

Description

Refactor the existing data preprocessing scripts(glue and squad) to make it

  1. appropriate for both BERT and XLNet.
  2. more scalable for future models.
  3. more efficient in squad preprocessing

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

@zburning zburning requested a review from a team as a code owner December 2, 2019 09:03
@codecov
Copy link

codecov bot commented Dec 2, 2019

Codecov Report

Merging #1031 into master will decrease coverage by 0.16%.
The diff coverage is 82.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1031      +/-   ##
==========================================
- Coverage   88.34%   88.18%   -0.17%     
==========================================
  Files          66       66              
  Lines        6290     6306      +16     
==========================================
+ Hits         5557     5561       +4     
- Misses        733      745      +12
Impacted Files Coverage Δ
src/gluonnlp/optimizer/__init__.py 100% <100%> (ø) ⬆️
src/gluonnlp/utils/version.py 100% <100%> (ø) ⬆️
src/gluonnlp/utils/files.py 42.62% <18.18%> (ø) ⬆️
src/gluonnlp/optimizer/bert_adam.py 87.32% <81.57%> (ø) ⬆️
src/gluonnlp/data/utils.py 86.39% <95.34%> (ø) ⬆️
src/gluonnlp/model/train/language_model.py 88.51% <0%> (-5.27%) ⬇️

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Two preliminary comments below

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Please see below comments on ConcatSeqTransform

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Comments regarding BertTStyleSentenceTransform

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
Btw, in general please follow the https://www.python.org/dev/peps/pep-0257/ convention. In particular you typically have

class X:
"""
Content
Detail

which is against the convention and should be

class X:
"""Content

Detail

instead

return concat, segment_ids, p_mask


class BertStyleGlueTransform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this in the scripts folder until tackling the GlueTasks. There's a dependency between both.

span_text = all_doc_tokens[doc_span.start:doc_span.start +
doc_span.length]

# Insert [sep]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this in one call to the ConcatSeqTransform instead of two calls? You can use functools.partial to fix the separator argument.

return doc_spans


class SimpleQAPreparation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is only applicable to BERT style models, move it to a model specific file (eg gluonnlp/data/bert.py?
Alternatively, if it's a generic operation let's better describe what is done. It seems the purpose is to zip query and doc spans with the separators and update the positions accordingly? The name SimpleQAPreparation does not provide any clue on this

self.is_training = is_training
self._cls_token = cls_token
self._sep_token = sep_token
self._vocab = vocab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the mapping as part of this class? Each class here should do one thing and do it well. If it does many unrelated things, it becomes hard to understand and to reuse.

all_doc_tokens,
max_tokens_for_doc=None,
query_tokens_length=None):
_DocSpan = collections.namedtuple( # pylint: disable=invalid-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't create a new _DocSpan in every call. You can do it outside.

return tok_start_position, tok_end_position, all_doc_tokens, query_tokens


class DocSpanTransform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that having DocSpans to denote windows over a list of tokens is helpful. I think the code is simpler when you materialize each sliding window: Consider a document with 1000 tokens, doc = [0] * 1000. We can get the windows explicitly as window1 = doc[0:100]; window2 = doc[50:150] etc. Of course you need to calculate the window-specific start and end positions when creating the window. But this should allow removing a lot of the code where you currently have to handle DocSpan objects. Instead each window will be completely self-contained after the transform.

As each window is just a list of integers, there is very little overhead in materializing the windows. Lists of integers are highly optimized in Python and we can optimizer more with numpy if needed. I think the introduction of DocSpan here is a sign of premature optimization.

@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-1031/25 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/25/index.html

@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-1031/26 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/26/index.html

@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-1031/27 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/27/index.html

@leezu leezu changed the title [REFACTOR] refactor the glue&squad data preprocessing pipeline and bert&xlnet scripts [REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts Jan 16, 2020
@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1031/28 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/28/index.html

@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1031/29 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/29/index.html

@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1031/30 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/30/index.html

@mli
Copy link
Member

mli commented Jan 16, 2020

Job PR-1031/31 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/31/index.html

@mli
Copy link
Member

mli commented Jan 17, 2020

Job PR-1031/32 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/32/index.html


nlp.utils.check_version('0.8.1', warning_only=True)
#nlp.utils.check_version('0.8.1', warning_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to nlp.utils.check_version('0.9', warning_only=True)?

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! A few comments but mostly fine

@mli
Copy link
Member

mli commented Jan 20, 2020

Job PR-1031/33 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/33/index.html

@mli
Copy link
Member

mli commented Jan 21, 2020

Job PR-1031/34 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1031/34/index.html

@leezu leezu merged commit 5fe8d7b into dmlc:master Jan 22, 2020
@@ -330,8 +330,9 @@ def test_export(task):
@pytest.mark.integration
@pytest.mark.parametrize('sentencepiece', [False, True])
def test_finetune_squad(sentencepiece):
arguments = ['--optimizer', 'adam', '--batch_size', '12',
'--gpu', '0', '--epochs', '2', '--debug']
arguments = ['--optimizer', 'adam', '--batch_size', '32',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zburning seems this test not runs 1195.36s and is the longest running test. Would it be reasonable to reduce runtime?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants