Skip to content

[dbsp] Use std::unstable_sort in merger.#5571

Merged
gz merged 2 commits intomainfrom
std_sort
Feb 6, 2026
Merged

[dbsp] Use std::unstable_sort in merger.#5571
gz merged 2 commits intomainfrom
std_sort

Conversation

@ryzhyk
Copy link
Contributor

@ryzhyk ryzhyk commented Feb 5, 2026

One of the things we do to speed up compilation of pilelines is use a dynamically typed implementation of the vector sorting algorithm. Instead of having to compile a monomorphized implementation of sorting for every time in the pipeline, we have a single implementation that can be parameterized at runtime by a comparison function and the length of the value (the latter is used to swap values in the process of sorting). Unsurprisingly this implementation is less efficient, especially when sorting simple types like (u64, u64). @gz and I have measured the overhead and it seems to range between 4x (when sorting simple types like u64) and 1.5x for a 10-tuple (u64,...,u64).

The main source of the overhead is that the monomorphic implementation cannot rely on compiler optimizations to efficiently copy values (esp small values) and must use std::ptr::copy_nonoverlapping instead.

Sorting is used in two places in the DBSP code:

  1. In the MergeBatcher, when building sorted batches out of unsorted vectors. This one is only instantiated for weighted tuples. It's primarily used in arranging input data into sorted batches and by the join operator, which produces unsorted outputs.
  2. In trait Vector, which is instantiated for almost all types, but is only used in a few random places and is not performance critical.

This PR switches the former to using statically typed sort_unstable_by from the standard library. It appears that the impact on compilation time is <=5% (I've only done very limited testing though). The impact on executable size is even smaller.

As an additional benefit, we now take advantage of unstable sorting (I did not manage to quickly implement unstable sorting in DBSP and gave up on it back in the day, so we were using stable sorting everywhere).

One of the things we do to speed up compilation of pilelines is use a
dynamically typed implementation of the vector sorting algorithm. Instead of
having to compile a monomorphized implementation of sorting for every time in
the pipeline, we have a single implementation that can be parameterized at
runtime by a comparison function and the length of the value (the latter is
used to swap values in the process of sorting). Unsurprisingly this
implementation is less efficient, especially when sorting simple types like
(u64, u64). @gz and I have measured the overhead and it seems to range between
4x (when sorting simple types like u64) and 1.5x for a 10-tuple (u64,...,u64).

The main source of the overhead is that the monomorphic implementation cannot
rely on compiler optimizations to efficiently copy values (esp small values)
and must use std::ptr::copy_nonoverlapping instead.

Sorting is used in two places in the DBSP code:

1. In the MergeBatcher, when building sorted batches out of unsorted vectors.
   This one is only instantiated for weighted tuples.
2. In `trait Vector`, which is instantiated for almost all types, but is
   only used in a few random places and is not performance critical.

This PR switches the former to using statically typed sort_unstable_by from
the standard library. It appears that the impact on compilation time is <=5%
(I've only done very limited testing though).

As an additional benefit, we now take advantage of unstable sorting (I did
not manage to quickly implement unstable sorting in DBSP and gave up on it
back in the day, so we were using stable sorting everywhere).

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk requested review from Copilot and gz February 5, 2026 22:20
@ryzhyk ryzhyk added DBSP core Related to the core DBSP library performance labels Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes sorting performance in DBSP's batch merger by replacing a custom dynamically-typed sorting implementation with Rust's standard library sort_unstable_by. The change addresses a performance overhead (ranging from 1.5x to 4x) introduced by the previous implementation while keeping compilation time impact minimal (≤5%).

Changes:

  • Replaced unstable_sort_by custom function calls with std::slice::sort_unstable_by in consolidation functions
  • Removed import of unstable_sort_by from utils module
  • Maintained identical comparison logic for all sorting operations

@mihaibudiu
Copy link
Contributor

will unstable sort impact determinism?
or is the merger already non-deterministic enough?

Comment on lines 132 to 133
// This line right here is literally the hottest code within the entirety of the
// program. It makes up 90% of the work done while joining or merging anything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen that line come up in a profile :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this comment is not correct. I'll remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can replace it with "should be, but it isn't for some unknown reason"

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I notice that all of them were already unstable

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Feb 5, 2026

will unstable sort impact determinism? or is the merger already non-deterministic enough?

It shouldn't. In fact the function we used here was supposed to implement unstable sort, but it ended up being an alias to the stable sort function.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk changed the title [dbsp] Use std::unstable sort in merger. [dbsp] Use std::unstable_sort in merger. Feb 5, 2026
@gz gz added this pull request to the merge queue Feb 6, 2026
Merged via the queue into main with commit 8ab9921 Feb 6, 2026
1 check passed
@gz gz deleted the std_sort branch February 6, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants