Skip to content

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Apr 6, 2021

Description of your changes:
Fixes: #5306.
Blocked by: #5287, #5676
I refactored E2E MNIST Kubeflow example. I named it kubeflow-e2e-mnist to be more precise and I added OWNERS file.
Please let me know if I need to add someone else in the OWNERS.

For Katib, TFJob and KFServing I am using the upstream launchers to run the KFP tasks.
This example uses namespaced Pipeline to run it from the Kubeflow Notebook and we should merge: #5287 to update the KFP SDK.

Please take a look.

/assign @Bobgy @zijianjoy @Tomcli @chinhuang007
/cc @gaocegege @johnugeorge @elikatsis

Checklist:

@google-oss-robot
Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: chinhuang007.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Description of your changes:
Fixes: #5306.
Blocked by: #5287.
I refactored E2E MNIST Kubeflow example. I named it kubeflow-e2e-mnist to be more precise and I added OWNERS file.
Please let me know if I need to add someone else in the OWNERS.

For Katib, TFJob and KFServing I am using the upstream launchers to run the KFP tasks.
This example uses namespaced Pipeline to run it from the Kubeflow Notebook and we should merge: #5287 to update the KFP SDK.

Please take a look.

/assign @Bobgy @zijianjoy @Tomcli @chinhuang007
/cc @gaocegege @johnugeorge @elikatsis

Checklist:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Bobgy
Copy link
Contributor

Bobgy commented Apr 9, 2021

Thank you for these efforts!
/cc @zijianjoy
This sounds like a good example to verify kubeflow e2e! May be we can point distributions to test this.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 9, 2021

I'll report back when there's progress on the KFP PR

@zijianjoy
Copy link
Contributor

Thank you for these efforts!
/cc @zijianjoy
This sounds like a good example to verify kubeflow e2e! May be we can point distributions to test this.

Thank you Yuan! Added an item to use this E2E example to validate Kubeflow deployment in https://github.com/kubeflow/pipelines/projects/12

@zijianjoy
Copy link
Contributor

Hello @andreyvelich , I have validated the E2E mnist example using this PR, but I encountered this issue but couldn't explain why, would you like to help on debugging? GoogleCloudPlatform/kubeflow-distribution#271

@andreyvelich andreyvelich changed the title [WIP] fix(components): Refactor Kubeflow E2E mnist example fix(components): Refactor Kubeflow E2E mnist example Jul 21, 2021
@andreyvelich
Copy link
Member Author

Hi @Bobgy @zijianjoy, I was able to run this example in Multi-User mode using the latest KFP SDK version 1.6.5.
As I can see, this version includes this SDK change: #5676.

If you are fine with this example, I think we can finally merge it.

cc @elikatsis @johnugeorge

@zijianjoy
Copy link
Contributor

Thank you @andreyvelich for the change!

I redeployed a cluster to validate the E2E workflow, I encountered the following issue in the first step:

mnist_firststep

How to look closer for the log to debug further?

@andreyvelich
Copy link
Member Author

Thank you for the testing @zijianjoy.
Which version of GKE cluster are you using ?
If your GKE Nodes are using Container-Optimized OS with Containerd image type, do you have the problem mentioned here: GoogleCloudPlatform/kubeflow-distribution#271 (comment)?

@zijianjoy
Copy link
Contributor

/lgtm
/approve

Debugged with Andrey offline, I am able to run the E2E example:

  1. We shouldn't restrict resource quota in user profile, which is why I failed the pipeline in fix(components): Refactor Kubeflow E2E mnist example #5433 (comment).
  2. We are able to ping the prediction endpoint with http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict, but not mnist-e2e.namespace1.svc.cluster.local, which is supposed to be the right way to call prediction service.

Approving this PR, thank you so much Andrey again for helping to refresh the E2E workflow example!

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: zijianjoy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 89c36d3 into kubeflow:master Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E-mnist sample not up-to-date with Katib launcher

5 participants