Skip to content

feat: add a way to skip second join for triplets generation in Pregel #790

@SemyonSinchenko

Description

@SemyonSinchenko

Is your feature request related to a problem? Please describe.
At the moment there are two internal joins in Pregel:

  1. edges LEFT JOIN states ON id = src -> triplets
  2. triplets LEFT JOIN states ON id = dst -> full-triplets

It is a right way to generate full triplets and I still think that it is better than GraphX model (materialize and update triplets) because memory is more important that cpu.

But. For some algorithms (LPA, PageRank, K-Core, Rocha-Thatte, etc.) the destination state is not required at all!

Describe the solution you would like
A new optional flag requireDstState with true by default (current behavior). If it is false, skip the second join at all.

Provide this only in Scala, cause it is a kind of "developer API" that require from people to understand what do they do. For example, if user requested sendMsgToDst or any dst columns in message generation and at the same time turn of the described above flag, there will be tricky and complicated runtime errors. I would like to keep it in public JVM API only.

Also I want to update all the standard library (LPA, K-Core, Rocha-Thatte) with this optimization.

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • Infrastructure
  • PySpark Classic
  • PySpark Connect

Additional context

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions