Skip to content

Conversation

@astronautas
Copy link

@astronautas astronautas commented Nov 3, 2025

What this PR does / why we need it:

#5707. We have observed significant slowdown in heavy materialization jobs. Heavy = pulling in significant period of {start_ts, end_ts}, when essentially you end up with many values to be deduplicated by Feast's compute engine. Unfortunately, Polars local engine did not show any significant speed-up. Spark is out of question due to sheer complexity of adding yet another cluster engine. We also observed nearly twice increased memory usage...

Not being a fan of altering the core, I would like to introduce a local change to Clickhouse provider to pushdown the deduplication logic to offline store. This assumes you don't have any Feast compute transformations or aggregation. I would like to just use a flag for now and not make it overcomplicated i.e. try to infer it's Pandas, if it has compute engine transformations. Keeping it simple, for now, and disabled by default.

Which issue(s) this PR fixes:

#5707

Misc

@astronautas astronautas requested a review from a team as a code owner November 3, 2025 12:56
@astronautas
Copy link
Author

astronautas commented Nov 3, 2025

@HaoXuAI Let me know what you think. I really wouldn't like to make it more complicated than above, for now 😁 .

@astronautas astronautas force-pushed the feat/add-deduplicate-pushdown-to-clickhouse branch from cbcaab6 to 7579962 Compare November 3, 2025 13:02
@astronautas astronautas force-pushed the feat/add-deduplicate-pushdown-to-clickhouse branch from 1a442cd to cfaf9a6 Compare November 3, 2025 13:15
@astronautas astronautas changed the title feat: Add deduplicate pushdown to clickhouse feat: Add deduplicate pushdown to clickhouse - improve materialize performance Nov 3, 2025
@astronautas
Copy link
Author

@franciscojavierarceo cc

@franciscojavierarceo
Copy link
Member

I guess I'm curious who wouldn't want this behavior natively? Based on the other thread, it sounds like we degraded functionality and probably it would make sense to remove the pandas deduplication and make that optional.

What do you think @HaoXuAI?

Copy link
Collaborator

@HaoXuAI HaoXuAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing, but I think this will make it inconsistent with other offline store.
Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest.
@franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.

@astronautas
Copy link
Author

Thanks for contributing, but I think this will make it inconsistent with other offline store. Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest. @franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.

Will adjust, makes sense.

@astronautas
Copy link
Author

#5713 closing in favor of this

@astronautas astronautas closed this Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants