-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709
Conversation
…terialization logic (calling it) Signed-off-by: lukas.valatka <[email protected]>
Signed-off-by: lukas.valatka <[email protected]>
|
@HaoXuAI Let me know what you think. I really wouldn't like to make it more complicated than above, for now 😁 . |
Signed-off-by: lukas.valatka <[email protected]>
cbcaab6 to
7579962
Compare
Signed-off-by: lukas.valatka <[email protected]>
Signed-off-by: lukas.valatka <[email protected]>
1a442cd to
cfaf9a6
Compare
|
I guess I'm curious who wouldn't want this behavior natively? Based on the other thread, it sounds like we degraded functionality and probably it would make sense to remove the pandas deduplication and make that optional. What do you think @HaoXuAI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing, but I think this will make it inconsistent with other offline store.
Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest.
@franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.
Will adjust, makes sense. |
|
#5713 closing in favor of this |
What this PR does / why we need it:
#5707. We have observed significant slowdown in heavy materialization jobs. Heavy = pulling in significant period of {start_ts, end_ts}, when essentially you end up with many values to be deduplicated by Feast's compute engine. Unfortunately, Polars local engine did not show any significant speed-up. Spark is out of question due to sheer complexity of adding yet another cluster engine. We also observed nearly twice increased memory usage...
Not being a fan of altering the core, I would like to introduce a local change to Clickhouse provider to pushdown the deduplication logic to offline store. This assumes you don't have any Feast compute transformations or aggregation. I would like to just use a flag for now and not make it overcomplicated i.e. try to infer it's Pandas, if it has compute engine transformations. Keeping it simple, for now, and disabled by default.
Which issue(s) this PR fixes:
#5707
Misc