-
Notifications
You must be signed in to change notification settings - Fork 262
Description
Is your feature request related to a problem? Please describe.
At the moment there are two internal joins in Pregel:
- edges LEFT JOIN states ON id = src -> triplets
- triplets LEFT JOIN states ON id = dst -> full-triplets
It is a right way to generate full triplets and I still think that it is better than GraphX model (materialize and update triplets) because memory is more important that cpu.
But. For some algorithms (LPA, PageRank, K-Core, Rocha-Thatte, etc.) the destination state is not required at all!
Describe the solution you would like
A new optional flag requireDstState with true by default (current behavior). If it is false, skip the second join at all.
Provide this only in Scala, cause it is a kind of "developer API" that require from people to understand what do they do. For example, if user requested sendMsgToDst or any dst columns in message generation and at the same time turn of the described above flag, there will be tricky and complicated runtime errors. I would like to keep it in public JVM API only.
Also I want to update all the standard library (LPA, K-Core, Rocha-Thatte) with this optimization.
Component
- Scala Core Internal
- Scala API
- Spark Connect Plugin
- Infrastructure
- PySpark Classic
- PySpark Connect
Additional context
Are you planning on creating a PR?
- I'm willing to make a pull-request