-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Support megatron dataset for T5 #6659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
Codecov Report
@@ Coverage Diff @@
## develop #6659 +/- ##
===========================================
- Coverage 60.06% 59.90% -0.17%
===========================================
Files 552 554 +2
Lines 81755 81975 +220
===========================================
Hits 49105 49105
- Misses 32650 32870 +220
|
a47f2dc to
3bed5b1
Compare
e843a53 to
37ce41e
Compare
examples/language_model/t5/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在数据ID化步骤中,我们需要配置tokenzer_name,选择t5模型对应的tokenizer;通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz.(这里提供了一个处理好的预训练数据,可点击链接下载)
这块需要搞一个样例数据出来
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 目前有冲突需要解决一下。
- 是否对齐新旧版本的数据?比如同一份数据处理出的旧版npy和新版bin,设定seed,跑前几步看看两版拿到的数据是否一样。
examples/language_model/t5/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补充一下mmap和lazy的区别:“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:指定输入文件数据制作类型,默认为mmap,可指定mmap或lazy。“mmap”格式在读入数据时会建立内存映射,“lazy”格式在读入数据时直接从文件读取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议和llama一致:help="mmap/lazy format converted from preprocessed data."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:"help": "mmap/lazy format converted from preprocessed data."
model_zoo/ernie-1.0/args.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好和llama保持一致,“mmap/lazy format converted from preprocessed data”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改为:help="mmap/lazy format converted from preprocessed data."
examples/language_model/t5/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在数据准备中,预置的token ids:baike_sample_ids.npy, 文章索引信息baike_sample_idx.npz样例应改为bin格式与idx格式,数据制作可以参考这里,注意参数配置,参考这里
e7ab798 to
9cd67e7
Compare
|
9cd67e7 to
08ce7df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/data/indexed_dataset.py没看到任何改动,可以不用放在 PR 里
examples/language_model/t5/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥出来的是 gpt prefix的数据集?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最后t5用的是gpt的openwebtext数据集,所以prefix是gpt,改成t5吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用户按你的步骤跑出来的数据集是什么名称呢?保持一致,要可复现。
已修改
ok |
06d6b9a to
673d356
Compare
1c9e3c4 to
d225406
Compare
Support megatron dataset for T5
d225406 to
6a41b11
Compare
PR types
New features
PR changes
APIs
Description
Support megatron dataset for T5