-
Notifications
You must be signed in to change notification settings - Fork 529
[REFACTOR] Refactor the Glue data preprocessing pipeline and bert&xlnet scripts #1031
Conversation
deleting trailing white space
merge from master
merge from master
Codecov Report
@@ Coverage Diff @@
## master #1031 +/- ##
==========================================
- Coverage 88.34% 88.18% -0.17%
==========================================
Files 66 66
Lines 6290 6306 +16
==========================================
+ Hits 5557 5561 +4
- Misses 733 745 +12
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Two preliminary comments below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Please see below comments on ConcatSeqTransform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Comments regarding BertTStyleSentenceTransform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Btw, in general please follow the https://www.python.org/dev/peps/pep-0257/ convention. In particular you typically have
class X:
"""
Content
Detail
which is against the convention and should be
class X:
"""Content
Detail
instead
return concat, segment_ids, p_mask | ||
|
||
|
||
class BertStyleGlueTransform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this in the scripts folder until tackling the GlueTasks
. There's a dependency between both.
span_text = all_doc_tokens[doc_span.start:doc_span.start + | ||
doc_span.length] | ||
|
||
# Insert [sep] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this in one call to the ConcatSeqTransform
instead of two calls? You can use functools.partial
to fix the separator
argument.
return doc_spans | ||
|
||
|
||
class SimpleQAPreparation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is only applicable to BERT
style models, move it to a model specific file (eg gluonnlp/data/bert.py
?
Alternatively, if it's a generic operation let's better describe what is done. It seems the purpose is to zip query and doc spans with the separators and update the positions accordingly? The name SimpleQAPreparation
does not provide any clue on this
self.is_training = is_training | ||
self._cls_token = cls_token | ||
self._sep_token = sep_token | ||
self._vocab = vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the mapping as part of this class? Each class here should do one thing and do it well. If it does many unrelated things, it becomes hard to understand and to reuse.
all_doc_tokens, | ||
max_tokens_for_doc=None, | ||
query_tokens_length=None): | ||
_DocSpan = collections.namedtuple( # pylint: disable=invalid-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't create a new _DocSpan
in every call. You can do it outside.
return tok_start_position, tok_end_position, all_doc_tokens, query_tokens | ||
|
||
|
||
class DocSpanTransform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that having DocSpan
s to denote windows over a list of tokens is helpful. I think the code is simpler when you materialize each sliding window: Consider a document with 1000 tokens, doc = [0] * 1000
. We can get the windows explicitly as window1 = doc[0:100]; window2 = doc[50:150]
etc. Of course you need to calculate the window-specific start and end positions when creating the window. But this should allow removing a lot of the code where you currently have to handle DocSpan
objects. Instead each window will be completely self-contained after the transform.
As each window is just a list of integers, there is very little overhead in materializing the windows. Lists of integers are highly optimized in Python and we can optimizer more with numpy if needed. I think the introduction of DocSpan here is a sign of premature optimization.
Job PR-1031/25 is complete. |
Job PR-1031/26 is complete. |
Job PR-1031/27 is complete. |
Job PR-1031/28 is complete. |
Job PR-1031/29 is complete. |
Job PR-1031/30 is complete. |
Job PR-1031/31 is complete. |
Job PR-1031/32 is complete. |
scripts/bert/finetune_classifier.py
Outdated
|
||
nlp.utils.check_version('0.8.1', warning_only=True) | ||
#nlp.utils.check_version('0.8.1', warning_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to nlp.utils.check_version('0.9', warning_only=True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! A few comments but mostly fine
Job PR-1031/33 is complete. |
Job PR-1031/34 is complete. |
@@ -330,8 +330,9 @@ def test_export(task): | |||
@pytest.mark.integration | |||
@pytest.mark.parametrize('sentencepiece', [False, True]) | |||
def test_finetune_squad(sentencepiece): | |||
arguments = ['--optimizer', 'adam', '--batch_size', '12', | |||
'--gpu', '0', '--epochs', '2', '--debug'] | |||
arguments = ['--optimizer', 'adam', '--batch_size', '32', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zburning seems this test not runs 1195.36s and is the longest running test. Would it be reasonable to reduce runtime?
Description
Refactor the existing data preprocessing scripts(glue and squad) to make it
Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team