Skip to content

Conversation

juliusvonkohout
Copy link
Member

@juliusvonkohout juliusvonkohout commented Jun 10, 2025

Description of your changes:

closes kubeflow/manifests#3119

Thank you everyone!
I am pushing this for 4 years or so and even had google and redhat employees involved, back then even Amazon. It is fundamental for CVEs, maintainabiliy (minio is now stuck for 5 years or so) and hard multi-tenancy as basic requirement for an enterprise platform. We also had approaches there for several years with minio. It started all in 2020 here #4649 and went via #7725 (2022) and kubeflow/manifests#2826 (October 2024) to kubeflow/manifests#3051 (2025). Without that experimental and extended tests it would have been very hard to pull of and coordinate. I want to especially highlight @pschoen-itsc who spent his effort here for the public health sector in Germany where many insurances need hard multi-tenancy to process data.

We evaluated many alternatives and now we have something S3 and IAM policy compatible, scalable and with hard multi-tenancy.

@akagami-harsh you can create branches against this PR.

  • use seaweedfs
  • replace the old sync.py etc.
  • refactor our tests to be usable within KFP
  • Remove Minio (including /env/azure and other legacy stuff)
  • Explain how to use the seaweedfs gateway for AWS/GCP/Azure S3

Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@juliusvonkohout
Copy link
Member Author

closes #7725

@HumairAK HumairAK added this to the KFP 2.6.0 milestone Jun 17, 2025
@HumairAK HumairAK moved this to In Review in KFP Project Tracker Jun 17, 2025
@juliusvonkohout juliusvonkohout marked this pull request as ready for review July 3, 2025 14:27
@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Jul 3, 2025

We are still missing
Explain how to use the seaweedfs gateway for AW3/GCP/Azure S3 (Harshvir)
add an architectural diagram here for minio and in general for kubeflow/manifest as from my kubecon presentations and blogs (Julius)

but it is at least ready for a first review @HumairAK

@juliusvonkohout
Copy link
Member Author

/retest

@droctothorpe
Copy link
Collaborator

This is incredibly comprehensive and impressive, @juliusvonkohout and @akagami-harsh. What obstacles do you anticipate for end users upgrading from Minio to Seawead? Does it make sense to provide migration documentation or automation?

@juliusvonkohout
Copy link
Member Author

This is incredibly comprehensive and impressive, @juliusvonkohout and @akagami-harsh. What obstacles do you anticipate for end users upgrading from Minio to Seawead? Does it make sense to provide migration documentation or automation?

If users really need the old data then the cluster administrator needs to handle it anyway by copying via boto3 from minio to seaweedfs. So that is something we could deal with in follow up PRs. We could provide a job/cronjob that does this automatically. i think even LLMS can write this.

More interesting is probaly the seaweedfs gateway documentation link that shows how users can connect seaweefdfs to AWS/GCP/Azure/S3 compatible object storage. But also there i prefer a follow up PR. Lets /approve and merge what we have @HumairAK @hbelmiro @droctothorpe and continue in follow up PRs.

@HumairAK HumairAK removed the request for review from rimolive July 9, 2025 16:57
@juliusvonkohout
Copy link
Member Author

/lgtm
@HumairAK for approval

@kubeflow kubeflow deleted a comment from google-oss-prow bot Aug 18, 2025
@akagami-harsh
Copy link
Contributor

akagami-harsh commented Aug 18, 2025

/lgtm
@HumairAK for approval

@akagami-harsh
Copy link
Contributor

akagami-harsh commented Aug 18, 2025

@akagami-harsh I noticed this flakiness occur in this CI failure https://github.com/kubeflow/pipelines/actions/runs/16993993913/job/48180413365?pr=11965

I noticed the seaweedfs pod was filled with:

2025-08-15T16:40:38.8534567Z I0815 16:39:12.797939 master_grpc_server_volume.go:141 volume grow &{Option:{"collection":"mlpipeline","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":1073741824,"version":3} Count:0 Force:false Reason:grpc assign}
2025-08-15T16:40:38.8535444Z E0815 16:39:12.801327 volume_grpc_admin.go:59 assign volume volume_id:82  collection:"mlpipeline"  preallocate:1073741824  replication:"000"  version:3: No more free space left
2025-08-15T16:40:38.8536338Z W0815 16:39:12.802934 volume_growth.go:273 Failed to assign volume 82 on topo:DefaultDataCenter:DefaultRack:10.244.0.26:8080: rpc error: code = Unknown desc = No more free space left
2025-08-15T16:40:38.8537464Z I0815 16:39:12.803731 volume_growth.go:120 create 7 volume, created 0: failed to assign volume 82 on topo:DefaultDataCenter:DefaultRack:10.244.0.26:8080: rpc error: code = Unknown desc = No more free space left
2025-08-15T16:40:38.8538277Z I0815 16:39:12.815178 master_grpc_server_volume.go:141 volume grow &{Option:{"collection":"mlpipeline","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":1073741824,"version":3} Count:0 Force:false Reason:grpc assign}

I also notice that the "Free up Disk Space" step took 8 min 39s to complete, which is odd given the others complete much faster. Still, the final disk usage reports the same - so I'm not sure it's related but it was the only one that stood out.

Not sure if this is seaweedFS specific, but looking at the history of this workflow, I couldn't find us encountering this behavior with minio in recent history. Thoughts?

I think lots of files were deleted in the free-disk space step, creating fragmented free space, there is a preallocation feature in seaweedfs. Preallocation needs large contiguous blocks, not just total free space. logs show preallocate:1073741824 (1GB) - it needs 1GB of contiguous space

2025-08-15T16:40:38.8534567Z I0815 16:39:12.797939 master_grpc_server_volume.go:141 volume grow &{Option:{"collection":"mlpipeline","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":1073741824,"version":3} Count:0 Force:false Reason:grpc assign}
2025-08-15T16:40:38.8535444Z E0815 16:39:12.801327 volume_grpc_admin.go:59 assign volume volume_id:82  collection:"mlpipeline"  preallocate:1073741824  replication:"000"  version:3: No more free space left
2025-08-15T16:40:38.8536338Z W0815 16:39:12.802934 volume_growth.go:273 Failed to assign volume 82 on topo:DefaultDataCenter:DefaultRack:10.244.0.26:8080: rpc error: code = Unknown desc = No more free space left
2025-08-15T16:40:38.8537464Z I0815 16:39:12.803731 volume_growth.go:120 create 7 volume, created 0: failed to assign volume 82 on topo:DefaultDataCenter:DefaultRack:10.244.0.26:8080: rpc error: code = Unknown desc = No more free space left
2025-08-15T16:40:38.8538277Z I0815 16:39:12.815178 master_grpc_server_volume.go:141 volume grow &{Option:{"collection":"mlpipeline","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":1073741824,"version":3} Count:0 Force:false Reason:grpc assign}

After disk cleanup, the disk likely has plenty of total free space but it's fragmented, i think this would explain why MinIO don't hit this issue

@HumairAK
Copy link
Collaborator

@akagami-harsh I'm skeptical that this is due to defragmentation, and it's likely due to the storage size we are setting in the pvc for seaweedfs, see here. Can we set it to the same size as the minio pvc at 20Gi here?

@akagami-harsh
Copy link
Contributor

@akagami-harsh I'm skeptical that this is due to defragmentation, and it's likely due to the storage size we are setting in the pvc for seaweedfs, see here. Can we set it to the same size as the minio pvc at 20Gi here?

@HumairAK, Opened a pr to update pvc volume size #12156

@google-oss-prow google-oss-prow bot removed the lgtm label Aug 19, 2025
Copy link

New changes are detected. LGTM label has been removed.

@HumairAK
Copy link
Collaborator

HumairAK commented Aug 20, 2025

thanks @akagami-harsh, this lgtm, can you squash your commits and add a meaningful clean commit message?

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Aug 20, 2025

thanks @akagami-harsh, this lgtm, can you squash your commits and add a meaningful clean commit message?

The commits will be automatically squashed on merge with the PR title as commit message, so i would like to avoid any further changes.

image

also given multiple authors it is a bit more complicated.

@HumairAK HumairAK merged commit 25af89c into master Aug 20, 2025
70 of 71 checks passed
@HumairAK HumairAK deleted the seaweedfs branch August 20, 2025 14:47
@github-project-automation github-project-automation bot moved this from In Review to Done in KFP Project Tracker Aug 20, 2025
@HumairAK
Copy link
Collaborator

HumairAK commented Aug 20, 2025

I updated the commit message to a more meaningful message - in general I prefer to have the PR authors curate their message as they have more domain knowledge.

Thank you everyone for all your hard work on this - a much needed change, well done all around !

@droctothorpe
Copy link
Collaborator

Congrats, everyone! Phenomenal (and overdue) enhancement! You should consider submitting a talk on this to the next Kubeflow Summit in Europe.

aniketpati1121 pushed a commit to aniketpati1121/Kubeflow-pipelines that referenced this pull request Aug 23, 2025
add seaweedFS to KFP as default object store

This change switches KFP's default objectstore deployment to Seaweedfs
instead of Minio. Minio is still kept as an optional deployment to 
help users with migrating. CI is updated to accommodate testing for 
Minio and SeaweedFS. Multi-User testing is also introduced, which also
includes namespace authorization testing in Seaweedfs.

Some more work is needed to completely rid the KFP backend and frontend
code of Minio specific code and labeling, which we accrue as tech debt
as part of this change. 


Signed-off-by: juliusvonkohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Harshvir Potpose <[email protected]>
Co-authored-by: Harshvir Potpose <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: pschoen-itsc <[email protected]>
aniketpati1121 pushed a commit to aniketpati1121/Kubeflow-pipelines that referenced this pull request Aug 27, 2025
add seaweedFS to KFP as default object store

This change switches KFP's default objectstore deployment to Seaweedfs
instead of Minio. Minio is still kept as an optional deployment to 
help users with migrating. CI is updated to accommodate testing for 
Minio and SeaweedFS. Multi-User testing is also introduced, which also
includes namespace authorization testing in Seaweedfs.

Some more work is needed to completely rid the KFP backend and frontend
code of Minio specific code and labeling, which we accrue as tech debt
as part of this change. 


Signed-off-by: juliusvonkohout <[email protected]>
Signed-off-by: Julius von Kohout <[email protected]>
Signed-off-by: Harshvir Potpose <[email protected]>
Co-authored-by: Harshvir Potpose <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: pschoen-itsc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Finish and upstream the minio replacement
8 participants