feat: Add deduplicate pushdown to clickhouse - improve materialize performance#5709
feat: Add deduplicate pushdown to clickhouse - improve materialize performance#5709astronautas wants to merge 7 commits intofeast-dev:masterfrom
Conversation
…terialization logic (calling it) Signed-off-by: lukas.valatka <lukas.valatka@cast.ai>
Signed-off-by: lukas.valatka <lukas.valatka@cast.ai>
|
@HaoXuAI Let me know what you think. I really wouldn't like to make it more complicated than above, for now 😁 . |
Signed-off-by: lukas.valatka <lukas.valatka@cast.ai>
cbcaab6 to
7579962
Compare
Signed-off-by: lukas.valatka <lukas.valatka@cast.ai>
Signed-off-by: lukas.valatka <lukas.valatka@cast.ai>
1a442cd to
cfaf9a6
Compare
|
I guess I'm curious who wouldn't want this behavior natively? Based on the other thread, it sounds like we degraded functionality and probably it would make sense to remove the pandas deduplication and make that optional. What do you think @HaoXuAI? |
HaoXuAI
left a comment
There was a problem hiding this comment.
Thanks for contributing, but I think this will make it inconsistent with other offline store.
Instead of changing at the store level, I would suggest to make the materialize an option config pull_latest=true, this will explicitly let the user to be able to either pull all or pull latest.
@franciscojavierarceo The reason we change to pull_all is because pull_latest doesn't work with the transformation such as aggregation and in some cases feature_transformation as well.
Will adjust, makes sense. |
|
#5713 closing in favor of this |
What this PR does / why we need it:
#5707. We have observed significant slowdown in heavy materialization jobs. Heavy = pulling in significant period of {start_ts, end_ts}, when essentially you end up with many values to be deduplicated by Feast's compute engine. Unfortunately, Polars local engine did not show any significant speed-up. Spark is out of question due to sheer complexity of adding yet another cluster engine. We also observed nearly twice increased memory usage...
Not being a fan of altering the core, I would like to introduce a local change to Clickhouse provider to pushdown the deduplication logic to offline store. This assumes you don't have any Feast compute transformations or aggregation. I would like to just use a flag for now and not make it overcomplicated i.e. try to infer it's Pandas, if it has compute engine transformations. Keeping it simple, for now, and disabled by default.
Which issue(s) this PR fixes:
#5707
Misc