[dbsp] Use std::unstable_sort in merger. by ryzhyk · Pull Request #5571 · feldera/feldera

ryzhyk · 2026-02-05T22:20:31Z

One of the things we do to speed up compilation of pilelines is use a dynamically typed implementation of the vector sorting algorithm. Instead of having to compile a monomorphized implementation of sorting for every time in the pipeline, we have a single implementation that can be parameterized at runtime by a comparison function and the length of the value (the latter is used to swap values in the process of sorting). Unsurprisingly this implementation is less efficient, especially when sorting simple types like (u64, u64). @gz and I have measured the overhead and it seems to range between 4x (when sorting simple types like u64) and 1.5x for a 10-tuple (u64,...,u64).

The main source of the overhead is that the monomorphic implementation cannot rely on compiler optimizations to efficiently copy values (esp small values) and must use std::ptr::copy_nonoverlapping instead.

Sorting is used in two places in the DBSP code:

In the MergeBatcher, when building sorted batches out of unsorted vectors. This one is only instantiated for weighted tuples. It's primarily used in arranging input data into sorted batches and by the join operator, which produces unsorted outputs.
In trait Vector, which is instantiated for almost all types, but is only used in a few random places and is not performance critical.

This PR switches the former to using statically typed sort_unstable_by from the standard library. It appears that the impact on compilation time is <=5% (I've only done very limited testing though). The impact on executable size is even smaller.

As an additional benefit, we now take advantage of unstable sorting (I did not manage to quickly implement unstable sorting in DBSP and gave up on it back in the day, so we were using stable sorting everywhere).

@gz

One of the things we do to speed up compilation of pilelines is use a dynamically typed implementation of the vector sorting algorithm. Instead of having to compile a monomorphized implementation of sorting for every time in the pipeline, we have a single implementation that can be parameterized at runtime by a comparison function and the length of the value (the latter is used to swap values in the process of sorting). Unsurprisingly this implementation is less efficient, especially when sorting simple types like (u64, u64). @gz and I have measured the overhead and it seems to range between 4x (when sorting simple types like u64) and 1.5x for a 10-tuple (u64,...,u64). The main source of the overhead is that the monomorphic implementation cannot rely on compiler optimizations to efficiently copy values (esp small values) and must use std::ptr::copy_nonoverlapping instead. Sorting is used in two places in the DBSP code: 1. In the MergeBatcher, when building sorted batches out of unsorted vectors. This one is only instantiated for weighted tuples. 2. In `trait Vector`, which is instantiated for almost all types, but is only used in a few random places and is not performance critical. This PR switches the former to using statically typed sort_unstable_by from the standard library. It appears that the impact on compilation time is <=5% (I've only done very limited testing though). As an additional benefit, we now take advantage of unstable sorting (I did not manage to quickly implement unstable sorting in DBSP and gave up on it back in the day, so we were using stable sorting everywhere). Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Copilot

Pull request overview

This PR optimizes sorting performance in DBSP's batch merger by replacing a custom dynamically-typed sorting implementation with Rust's standard library sort_unstable_by. The change addresses a performance overhead (ranging from 1.5x to 4x) introduced by the previous implementation while keeping compilation time impact minimal (≤5%).

Changes:

Replaced unstable_sort_by custom function calls with std::slice::sort_unstable_by in consolidation functions
Removed import of unstable_sort_by from utils module
Maintained identical comparison logic for all sorting operations

mihaibudiu · 2026-02-05T22:22:51Z

will unstable sort impact determinism?
or is the merger already non-deterministic enough?

blp · 2026-02-05T22:23:33Z

crates/dbsp/src/utils/consolidation.rs

    // This line right here is literally the hottest code within the entirety of the
    // program. It makes up 90% of the work done while joining or merging anything


I've never seen that line come up in a profile :-)

yeah, this comment is not correct. I'll remove it.

you can replace it with "should be, but it isn't for some unknown reason"

mihaibudiu

Actually I notice that all of them were already unstable

ryzhyk · 2026-02-05T22:25:26Z

will unstable sort impact determinism? or is the merger already non-deterministic enough?

It shouldn't. In fact the function we used here was supposed to implement unstable sort, but it ended up being an alias to the stable sort function.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk requested review from Copilot and gz February 5, 2026 22:20

ryzhyk added DBSP core Related to the core DBSP library performance labels Feb 5, 2026

Copilot AI reviewed Feb 5, 2026

View reviewed changes

blp reviewed Feb 5, 2026

View reviewed changes

mihaibudiu approved these changes Feb 5, 2026

View reviewed changes

[dbsp] Remove incorrect comments.

c62ded3

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk changed the title ~~[dbsp] Use std::unstable sort in merger.~~ [dbsp] Use std::unstable_sort in merger. Feb 5, 2026

gz approved these changes Feb 6, 2026

View reviewed changes

gz added this pull request to the merge queue Feb 6, 2026

Merged via the queue into main with commit 8ab9921 Feb 6, 2026
1 check passed

gz deleted the std_sort branch February 6, 2026 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dbsp] Use std::unstable_sort in merger.#5571

[dbsp] Use std::unstable_sort in merger.#5571
gz merged 2 commits intomainfrom
std_sort

ryzhyk commented Feb 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mihaibudiu commented Feb 5, 2026

Uh oh!

blp Feb 5, 2026

Uh oh!

ryzhyk Feb 5, 2026

Uh oh!

mihaibudiu Feb 5, 2026

Uh oh!

mihaibudiu left a comment

Uh oh!

ryzhyk commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// This line right here is literally the hottest code within the entirety of the
		// program. It makes up 90% of the work done while joining or merging anything

Conversation

ryzhyk commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mihaibudiu commented Feb 5, 2026

Uh oh!

blp Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ryzhyk Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mihaibudiu Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

ryzhyk commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryzhyk commented Feb 5, 2026 •

edited

Loading