Skip to content

[SQL] Compiler support for Star Joins#5544

Open
mihaibudiu wants to merge 6 commits intomainfrom
star
Open

[SQL] Compiler support for Star Joins#5544
mihaibudiu wants to merge 6 commits intomainfrom
star

Conversation

@mihaibudiu
Copy link
Contributor

Star joins are a new flavor of joins introduced in #5517
This PR adds compiler support for star joins.
They are used currently to combine the results of multiple aggregates over the same input collection. Prior to this PR a balanced tree of joins was synthesized.

Currently this PR converts the star joins at the end of compilation into trees of standard (incremental) joins, so it works even without support from DBSP. Once support for DBSP is merged, this conversion will be removed when rebasing on main.

This PR enables us to test that all other compiler transformations work correctly with star join operators.
I have introduced only star_join_index and star_join for now; I will add support for star_join_flatmap as well.

Copilot AI review requested due to automatic review settings January 31, 2026 00:01
@mihaibudiu mihaibudiu marked this pull request as draft January 31, 2026 00:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds compiler support for star joins, a new join operator variant introduced in #5517. Star joins are used to efficiently combine results from multiple aggregates over the same input collection. The implementation currently converts star joins to standard incremental joins at the end of compilation, with native DBSP support to be added in future work.

Changes:

  • Introduced star join operators (DBSPStarJoinOperator and DBSPStarJoinIndexOperator) and their base class
  • Modified compilation pipeline to handle star join operators throughout transformation passes
  • Refactored aggregate combination logic to use star joins instead of balanced binary join trees
  • Added incremental/non-incremental interface markers to operators for better type safety

Reviewed changes

Copilot reviewed 63 out of 63 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
DBSPStarJoinOperator.java New star join operator for unindexed results
DBSPStarJoinIndexOperator.java New star join operator for indexed results
DBSPStarJoinBaseOperator.java Base class for star join operators
ImplementJoins.java Renamed from ExpandJoins.java, implements star joins using balanced tree of standard joins
CalciteToDBSPCompiler.java Modified aggregate combination to use star joins instead of binary join trees
IIncremental.java New interface to mark incremental operators
ILinear.java Modified to extend IIncremental interface
RemoveUnusedFields.java Added support for removing unused fields from star join operators
ExpandOperators.java Added delta expansion logic for star join operators
OptimizeProjectionVisitor.java Extended to push projections through star join operators
CircuitVisitor.java Added visitor methods for star join operators
DBSPExpression.java Modified field() to return simplified expressions automatically
Regression2Tests.java Added test for issue 5520 (EXP overflow to infinity)
Comments suppressed due to low confidence (4)

sql-to-dbsp-compiler/SQL-compiler/src/main/java/org/dbsp/sqlCompiler/compiler/visitors/unusedFields/RemoveUnusedFields.java:1

  • The condition is inverted. The variable should be set to true when there ARE unused fields, not when there are no unused fields. This should be if (map.hasUnusedFields(1)).
    sql-to-dbsp-compiler/SQL-compiler/src/main/java/org/dbsp/sqlCompiler/compiler/visitors/outer/monotonicity/InsertLimiters.java:1
  • The condition true || makes the rest of the expression unreachable. This appears to be a temporary placeholder that should be removed once the TODO is implemented.
    sql-to-dbsp-compiler/SQL-compiler/src/main/java/org/dbsp/sqlCompiler/compiler/visitors/outer/intern/RewriteInternedFields.java:1
  • Corrected spelling of 'to' to 'do'.
    sql-to-dbsp-compiler/SQL-compiler/src/main/java/org/dbsp/sqlCompiler/compiler/backend/rust/ToRustVisitor.java:1
  • This loop adds a trailing comma after the last input. The comma should only be added between inputs, not after the last one.

@ryzhyk ryzhyk mentioned this pull request Feb 3, 2026
@mihaibudiu mihaibudiu marked this pull request as ready for review February 4, 2026 01:27
@mihaibudiu
Copy link
Contributor Author

This PR has been rebased on #5550

@mihaibudiu
Copy link
Contributor Author

@claude

@claude
Copy link

claude bot commented Feb 8, 2026

Claude finished @mihaibudiu's task in 2m 8s —— View job


Star Join Implementation Review

I've reviewed the PR changes implementing compiler support for star joins. Overall, this is a well-structured implementation that successfully introduces star join operators to the SQL compiler. Here's my detailed analysis:

Todo List:

  • Understand the star join architecture and design
  • Review new operator implementations
  • Check the compilation pipeline changes
  • Review visitor pattern implementations
  • Examine delta expansion logic
  • Check for potential issues and improvements

Architecture & Design ✅

Strong Points:

  • Clean three-operator hierarchy: DBSPStarJoinBaseOperatorDBSPStarJoinOperator / DBSPStarJoinIndexOperator
  • Properly marked as incremental-only with IIncremental interface
  • Consistent with existing join operator patterns
  • Well-documented purpose and constraints (N inputs, same key types)

Key Design Features:

  1. Star Join Detection (CreateStarJoins.java): Sophisticated pattern recognition that reverse-engineers keys from join trees
  2. Flexible Implementation (ImplementJoins.java): Toggle between native DBSP operators and balanced join trees via EXPAND_STAR flag
  3. Delta Expansion (DeltaExpandOperators.java): Correct incremental semantics - for N inputs, creates N terms with one delta and (N-1) integrals

Code Quality Review

Operator Implementations

DBSPStarJoinBaseOperator.java (Lines 15-45):

  • Proper input validation (N>1, all indexed, same key type)
  • Good parameter count check: function.parameters.length - 1 == inputs.size()
  • Clean with() method pattern for immutable updates

DBSPStarJoinOperator.java & DBSPStarJoinIndexOperator.java:

  • Both correctly implement visitor patterns
  • JSON deserialization properly implemented
  • Type checking enforces function result matches output type

Visitor Pattern Extensions

All visitor classes properly extended:

  • CircuitVisitor.java: Added preorder()/postorder() for both star join types
  • CircuitRewriter.java: Complete rewrite methods with proper input/function/type handling
  • CircuitCloneVisitor.java: Cloning logic implemented
  • Backend visitors (ToRustVisitor, ToDotNodesVisitor) updated

Optimization Support

OptimizeProjectionVisitor.java (Lines added):

  • Implements projection pushdown through star joins
  • Correctly handles multi-input scenario with field mapping
  • Maintains consistency with binary join optimization patterns

RemoveUnusedFields.java:

  • Properly tracks unused fields across all star join inputs
  • Extends closure rewriting to handle N-ary functions

Delta Expansion Analysis

processStarJoin() in DeltaExpandOperators.java is mathematically correct:

// Delta expansion: d(A★B★C) = dA★I(B)★I(C) + I(A)★dB★I(C) + I(A)★I(B)★dC
for (int i = 0; i < inputs.size(); i++) {
    List<OutputPort> joinInputs = new ArrayList<>(integratorOutputs);
    joinInputs.set(i, inputs.get(i));  // Replace i-th integral with delta
    // Create join with this combination
}

This correctly implements the multivariate chain rule for N-way joins, ensuring incremental correctness.


Issues & Recommendations

⚠️ Bot Review Comment is Incorrect

The bot flagged DBSPStarJoinIndexOperator.java:53 suggesting:

// Bot claimed this is wrong:
CalciteEmptyRel.INSTANCE, info.getZsetType(), ...

// Bot suggested:
CalciteEmptyRel.INSTANCE, info.getIndexedZsetType(), ...

This is actually CORRECT as-is. The CommonInfo record has:

public DBSPTypeIndexedZSet getIndexedZsetType() {
    return this.outputType.to(DBSPTypeIndexedZSet.class);
}

The fromJson() method passes outputType (which IS an IndexedZSet for this operator), so calling .getIndexedZsetType() or .getZsetType().to(IndexedZSet.class) achieves the same result. The current code is fine.

🔍 Minor Observations

  1. Debug Output (CreateStarJoins.java:277):

    System.out.println("Detected a star join");

    Consider replacing with proper logging framework (e.g., Logger.info()) for production code.

  2. EXPAND_STAR Flag (ImplementJoins.java:41):

    static final boolean EXPAND_STAR = false;

    This is currently hardcoded. Once DBSP support is merged, consider:

    • Making this a compiler option
    • Or removing the expansion code entirely if no longer needed
  3. Code Documentation:
    The PR description mentions "star_join_flatmap" will be added later. Consider adding TODO comments in the code where this extension point would fit.

Test Coverage


Performance & Correctness

Incremental Semantics: ✅ Correct

  • Delta expansion properly implements N-term sum for N inputs
  • Each term has exactly one delta input with (N-1) integrals
  • Mathematical foundation sound

Memory Management: ✅ Good

  • No obvious memory leaks
  • Proper operator chaining and cleanup

Type Safety: ✅ Strong

  • Compile-time checks enforce indexed inputs with same key type
  • Runtime validation in constructors
  • Function result types validated against output types

Summary

This PR successfully introduces star join operators with:

  • ✅ Clean architectural design following existing patterns
  • ✅ Mathematically correct delta expansion for incremental processing
  • ✅ Comprehensive visitor pattern integration
  • ✅ Proper optimization pass support
  • ✅ Good test coverage

Recommended Actions:

  1. Replace System.out.println with proper logging
  2. Consider making EXPAND_STAR configurable or remove when DBSP support merges
  3. Add TODO comments for future star_join_flatmap extension

No blocking issues found. The implementation is solid and ready for integration once DBSP runtime support is available.

--- | Branch: star

ryzhyk and others added 3 commits February 12, 2026 23:04
The macros that generate star join operators were not usable outside the DBSP
crate, since they added new methods to the `Stream` type.

We reimplement them to generate regular functions instead.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
…rs from trees of joins

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Benchmark star join vs a chain of binary joins.

```
--- Binary joins (chain of join_index + join) ---
  Time: 68.670540125s (25568562 output tuples)
  Total storage size: 1.05 GiB
  Per step: 6.867054012s

--- Star join (inner_star_join5) ---
  Time: 24.710012208s (25568562 output tuples)
  Total storage size: 48.01 MiB
  Per step: 2.47100122s
```

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
This fixes a perf issue in the join operator and a similar but more impactful
issue in `star_join`.

When iterating over the trace cursor, we use map_times to enumerate all (time,
weight) tuples associated with a value. Instead of computing their total weight
and evaluating the join function once for the entire value, we evaluated the
join function every time. When doing this in the context of a star join, we had
to also enumerate all other trace (time, weight) tuples, leading to an
exponential (in the number of joins) blow up in the amount of work per value if
the trace contains the same value with different weights in multiple batches.
This is probably not very common in practice (except for the case where the
weights add up to 0, which we don't need to worry about here), but when it
happens it's bad.

The implementation introduces a new cursor method `weight_checked`, which is
equiavalent, but more ergonomic and efficient than:

```rust
let mut total_w = 0;
cursor.map_times(|_t, w| total_w += **w);
```

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
The star join operator failed to check for out-of-bound cursor.

Coincidentally, this did not affect anything, but the issue was revealed by
debug checks added by the previous commit.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants