feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709

astronautas · 2025-11-03T12:56:18Z

What this PR does / why we need it:

#5707. We have observed significant slowdown in heavy materialization jobs. Heavy = pulling in significant period of {start_ts, end_ts}, when essentially you end up with many values to be deduplicated by Feast's compute engine. Unfortunately, Polars local engine did not show any significant speed-up. Spark is out of question due to sheer complexity of adding yet another cluster engine. We also observed nearly twice increased memory usage...

Not being a fan of altering the core, I would like to introduce a local change to Clickhouse provider to pushdown the deduplication logic to offline store. This assumes you don't have any Feast compute transformations or aggregation. I would like to just use a flag for now and not make it overcomplicated i.e. try to infer it's Pandas, if it has compute engine transformations. Keeping it simple, for now, and disabled by default.

Which issue(s) this PR fixes:

#5707

Misc

…terialization logic (calling it) Signed-off-by: lukas.valatka <[email protected]>

Signed-off-by: lukas.valatka <[email protected]>

astronautas · 2025-11-03T12:56:52Z

@HaoXuAI Let me know what you think. I really wouldn't like to make it more complicated than above, for now 😁 .

Signed-off-by: lukas.valatka <[email protected]>

…-clickhouse

Signed-off-by: lukas.valatka <[email protected]>

astronautas · 2025-11-03T16:03:27Z

@franciscojavierarceo cc

franciscojavierarceo · 2025-11-03T16:35:03Z

I guess I'm curious who wouldn't want this behavior natively? Based on the other thread, it sounds like we degraded functionality and probably it would make sense to remove the pandas deduplication and make that optional.

What do you think @HaoXuAI?

HaoXuAI

Thanks for contributing, but I think this will make it inconsistent with other offline store.
Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest.
@franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.

astronautas · 2025-11-03T19:52:46Z

Thanks for contributing, but I think this will make it inconsistent with other offline store. Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest. @franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.

Will adjust, makes sense.

astronautas · 2025-11-05T11:51:20Z

#5713 closing in favor of this

astronautas and others added 3 commits October 15, 2025 19:55

add pull_all_from_table_or_query for clickhouse, to align with new ma…

1752a4f

…terialization logic (calling it) Signed-off-by: lukas.valatka <[email protected]>

Merge branch 'feast-dev:master' into master

a6fcf5e

add possibility to pushdown clickhouse deduplication to offline store

f7b7d37

Signed-off-by: lukas.valatka <[email protected]>

astronautas requested a review from a team as a code owner November 3, 2025 12:56

add possibility to pushdown clickhouse deduplication to offline store

7579962

Signed-off-by: lukas.valatka <[email protected]>

astronautas force-pushed the feat/add-deduplicate-pushdown-to-clickhouse branch from cbcaab6 to 7579962 Compare November 3, 2025 13:02

Lukas Valatka and others added 3 commits November 3, 2025 15:03

Merge branch 'feast-dev:master' into feat/add-deduplicate-pushdown-to…

762de72

…-clickhouse

use info, since it's an intended path too

50d84d1

Signed-off-by: lukas.valatka <[email protected]>

make timestamps mandatory, filtering won't work otherwise

cfaf9a6

Signed-off-by: lukas.valatka <[email protected]>

astronautas force-pushed the feat/add-deduplicate-pushdown-to-clickhouse branch from 1a442cd to cfaf9a6 Compare November 3, 2025 13:15

astronautas changed the title ~~feat: Add deduplicate pushdown to clickhouse~~ feat: Add deduplicate pushdown to clickhouse - improve materialize performance Nov 3, 2025

HaoXuAI requested changes Nov 3, 2025

View reviewed changes

astronautas closed this Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709

feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709

astronautas commented Nov 3, 2025 •

edited

Loading

Uh oh!

astronautas commented Nov 3, 2025 •

edited

Loading

Uh oh!

astronautas commented Nov 3, 2025

Uh oh!

franciscojavierarceo commented Nov 3, 2025

Uh oh!

HaoXuAI left a comment

Uh oh!

astronautas commented Nov 3, 2025

Uh oh!

astronautas commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709

feat: Add deduplicate pushdown to clickhouse - improve materialize performance #5709

Conversation

astronautas commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Misc

Uh oh!

astronautas commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astronautas commented Nov 3, 2025

Uh oh!

franciscojavierarceo commented Nov 3, 2025

Uh oh!

HaoXuAI left a comment

Choose a reason for hiding this comment

Uh oh!

astronautas commented Nov 3, 2025

Uh oh!

astronautas commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

astronautas commented Nov 3, 2025 •

edited

Loading

astronautas commented Nov 3, 2025 •

edited

Loading