Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Jun 24, 2019

Description

Currently our CI takes more than 1 hour, running multiple Jenkinsfiles in parallel. The longest running is GluonNLP-py3-master-gpu-doc with expected runtime to gradually worsen as we add more examples.

This PR speeds up the GluonNLP-py3-master-gpu-doc run by parallelizing execution of our example notebooks with AWS Batch, scheduling a separate Job for each notebook. As a result, GluonNLP-py3-master-gpu-doc only runs 30 minutes, out of which 10 minutes are spent waiting for AWS Batch to begin execution. As AWS Batch service is expected to improve, we may hope for an execution time of ~20 minutes in the future.

Now, GluonNLP-py3-master-gpu-integration and GluonNLP-py3-gpu-integration are our longest running Jenkinsfiles.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Serverless CI Pipeline based on AWS Batch

@leezu leezu requested a review from szha as a code owner June 24, 2019 12:23
@codecov
Copy link

codecov bot commented Jun 24, 2019

Codecov Report

Merging #791 into master will decrease coverage by 9%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #791      +/-   ##
==========================================
- Coverage    90.5%   81.49%   -9.01%     
==========================================
  Files          64       64              
  Lines        6295     6295              
==========================================
- Hits         5697     5130     -567     
- Misses        598     1165     +567
Impacted Files Coverage Δ
src/gluonnlp/model/train/cache.py 26.19% <0%> (-71.43%) ⬇️
src/gluonnlp/model/train/language_model.py 42.04% <0%> (-55.12%) ⬇️
src/gluonnlp/embedding/evaluation.py 41.8% <0%> (-54.1%) ⬇️
src/gluonnlp/data/batchify/language_model.py 44.03% <0%> (-52.3%) ⬇️
src/gluonnlp/model/translation.py 20.63% <0%> (-50.8%) ⬇️
src/gluonnlp/model/language_model.py 50.38% <0%> (-49.62%) ⬇️
src/gluonnlp/model/bert.py 70.28% <0%> (-28.99%) ⬇️
src/gluonnlp/data/translation.py 73.64% <0%> (-26.36%) ⬇️
src/gluonnlp/model/train/__init__.py 75% <0%> (-25%) ⬇️
src/gluonnlp/model/elmo.py 77.55% <0%> (-20.41%) ⬇️
... and 13 more

@leezu leezu force-pushed the batch branch 28 times, most recently from 3531aff to eef2b36 Compare June 24, 2019 13:34
@leezu leezu force-pushed the batch branch 5 times, most recently from 0196a01 to 41a868e Compare June 26, 2019 09:23
@mli
Copy link
Member

mli commented Jun 26, 2019

Job PR-791/113 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-791/113/index.html

@leezu leezu force-pushed the batch branch 2 times, most recently from 85c3420 to 31f8f02 Compare June 26, 2019 10:21
Sheng Zha and others added 4 commits June 26, 2019 10:30
- Automatically set correct working directory
- Manually set encoding; System environment may not be set up correctly for
  inferring encoding
@leezu leezu changed the title [WIP] [CI] Serverless CI Pipeline based on AWS Batch [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step Jun 26, 2019
@leezu leezu force-pushed the batch branch 4 times, most recently from 1c840b4 to aae0145 Compare June 26, 2019 11:57
@mli
Copy link
Member

mli commented Jun 26, 2019

Job PR-791/123 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-791/123/index.html

@leezu leezu requested a review from eric-haibin-lin June 26, 2019 13:05
@leezu
Copy link
Contributor Author

leezu commented Jun 26, 2019

@szha @eric-haibin-lin please review and merge if everything looks good to you

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks.
Integration is taking ~55m which means we should probably have another way of testing the scripts.

@eric-haibin-lin eric-haibin-lin merged commit 0eaee2f into dmlc:master Jun 27, 2019
@leezu leezu deleted the batch branch June 27, 2019 08:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants