Concurrent encoders by ryzhyk · Pull Request #5361 · feldera/feldera

ryzhyk · 2026-01-02T06:53:22Z

In preparation for adding support for parallel encoding, wrap the batch passed to `Encoder::encode` in `Arc`, so it can later be cloned and sent to multiple workers. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Extend trait SerBatchReader two support parallel encoders: - SerBatchReader::partition_keys - partitions the batch into approximately equal sized chunks. - SerCursor::seek_key_exact - can be used to move the cursor to the start of a partition. - SerCursor::key() - can be used to compare the current key under the cursor with the end of the partition to stop iteration. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

All batches are now Send and Sync, so we eliminate the SyncSerBatchReader trait and add Send + Sync bounds to SerBatchReader. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk

I feel that this framework is still too opinionated. Please consider removing workerpool and using the threadpool crate or something similar.
Transports like Kafka support concurrent senders. I think we need to support that in the consumer API. I'll try to come up with something.
We need to figure out how to add a benchmark to this of next PR.

ryzhyk · 2026-02-04T20:00:22Z

crates/adapterlib/src/catalog.rs

+    }
+}
+
+pub struct SplitCursorBuilder {


Why do we need the builder type? Can't next_split return a cursor?

crates/adapterlib/src/catalog.rs

ryzhyk · 2026-02-04T20:10:56Z

crates/adapterlib/src/catalog.rs

+pub struct BatchSplitter {
+    batch: Arc<dyn SerBatchReader>,
+    bounds: Box<DynVec<DynData>>,
+    position: AtomicUsize,


Why does this need to be an atomic? I don't think BatchSplitter can be used from multiple threads.

In fact, I'm not sure why we need the BatchSplitter type at all. I haven't read the rest of the PR yet, but the simplest API would be just a function that creates a list of cursors based on a bounds array. Alternatively, if you need more flexibility, you may want to have a function that creates a cursor given start and end bounds. The advantage of that is you don't need a stateful object like BatchSplitter.

Some of this is a remnant of the previous version.
Yes, we don't need the position to be atomic anymore.

In that case, let's keep it simple and eliminate This BatchSplitter thing, unless there's something I missing.

crates/adapters/src/format/workerpool.rs

crates/adapters/src/format/avro/output.rs

Implements parallel encoders for the avro output encoder. Defines cursor types `SplitCursor<'_>` and `SplitCursorBuilder`. `SplitCursorBuilder` can be built from a list of bounds for a partition (typically created by the `partition_keys` method) and the index of the partition we want a cursor for. This builder type is required to be able to send cursors safely between threads. The `SplitCursor<'_>` type requires a reference to the batch, and therefore is not thread safe. To create a builder use the method: `SplitCursorBuilder::from_bounds(batch, bounds, idx, format)`; returns None if the `idx` partition cannot be created from the given bounds. `SplitCursor<'_>` can only be created by calling `SplitCursorBuilder::build()`. This cursor then is only valid for the current partition. Additionally, we create a `AvroParallelEncoder` type for each threadpool worker we want to run. This parallel encoder then sends the enocded batches back to the main thread via a channel. All encoding errors, if any, are gathered and returned when the `AvroEncoder::encode_indexed` method returns. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

Adds a benchmark for parallel encoders with Avro format. Seems like the sweet spot is 4 workers. All the code here was vibe-coded with Claude code. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer · 2026-02-08T20:46:17Z

The sweet spot seems to be 4 workers.

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

gz

can we name the threads in the threadpool?

overall still seems better to just take them 10k at a time and farm it out to the tokio runtime

gz · 2026-02-09T18:40:06Z

crates/adapters/benches/avro_encoder.rs

+    IsNone,
+)]
+#[archive_attr(derive(Ord, Eq, PartialEq, PartialOrd))]
+struct BenchTestStruct {


we have like 10 of these. can we just have one or two structs in the codebase for thesting, but actually put all sqllib types in them?

gz · 2026-02-09T18:42:34Z

crates/adapters/src/format/avro/output.rs

+        }
+
+        while cursor.val_valid() {
+            let w = cursor.weight();


it looks like we silently ignore things when weights are not -1 or 1

panic would seem appropriate here?

probably. Other weights should be filtered out by the indexed_operation_type() above

gz · 2026-02-09T18:47:27Z

crates/adapters/src/format/avro/output.rs

+            };
+
+            match (operation_type, self.update_format.clone()) {
+                (None, _) => (),


what does it mean when operation_type is none?

Means this is no-op (nothing's changed)

gz · 2026-02-09T18:48:55Z

crates/adapters/src/format/avro/output.rs

+                            &mut self.value_buffer,
+                        )?;
+                    }
+                    _ => (),


would this indicate a problem in our code somewhere? does it need some logging?

It indicates that this record will be handled below in the w == -1 branch.

This code is not new, it was just moved around in this PR.

gz · 2026-02-09T18:55:23Z

crates/dbsp/src/trace.rs

+    ///
+    /// * `num_partitions` - number of partitions to create.
+    /// * `bounds` - output vector to store the partition boundaries.
+    fn partition_keys(&self, num_partitions: usize, bounds: &mut DynVec<Self::Key>)


I dont understand why we need to sample keys for this, are there situations where a output view has many values per key

gz · 2026-02-09T19:06:57Z

crates/adapters/src/static_compile/seroutput.rs

+    }
+
+    fn data_factory(&self) -> &'static dyn Factory<DynData> {
+        self.batch.inner().factories().key_factory()


this is called data_factory but it returns key_factory. is it the right factory?

Yes, it does return the correct factory type, but yes it might make more sense to rename it.

gz · 2026-02-09T19:12:17Z

crates/feldera-types/src/format/avro.rs

+    /// the number of workers to run in parallel.
+    /// Default: 4
+    #[serde(default = "default_encoder_workers")]
+    pub workers: usize,


can we just call this threads instead of workers? it's confusing because we already have workers for dbsp workers

Adds documentation for the `workers` parameter on the Avro format configuration. Default: 4. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

ryzhyk added the connectors Issues related to the adapters/connectors crate label Jan 2, 2026

gz added the marketing Relevant for marketing content label Jan 20, 2026

ryzhyk added 3 commits February 2, 2026 20:56

[adapters] Wrap argument to encode in Arc.

d86cc72

In preparation for adding support for parallel encoding, wrap the batch passed to `Encoder::encode` in `Arc`, so it can later be cloned and sent to multiple workers. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

[adapters] Add Send + Sync for SerBatchReader.

6b04192

All batches are now Send and Sync, so we eliminate the SyncSerBatchReader trait and add Send + Sync bounds to SerBatchReader. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

abhizer force-pushed the issue5340 branch 2 times, most recently from 135be2f to 2e36d67 Compare February 2, 2026 16:05

ryzhyk commented Feb 4, 2026

View reviewed changes

abhizer added 2 commits February 8, 2026 23:26

adapters: benchmark avro parallel encoder

431e65f

Adds a benchmark for parallel encoders with Avro format. Seems like the sweet spot is 4 workers. All the code here was vibe-coded with Claude code. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer force-pushed the issue5340 branch from e4933a4 to 431e65f Compare February 8, 2026 20:40

abhizer marked this pull request as ready for review February 8, 2026 20:44

feldera-types: avro: set default workers to 4

e37f43d

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer force-pushed the issue5340 branch from f97e60d to e37f43d Compare February 9, 2026 09:09

ryzhyk changed the title ~~WIP: Concurrent encoders~~ Concurrent encoders Feb 9, 2026

ryzhyk enabled auto-merge February 9, 2026 18:36

gz approved these changes Feb 9, 2026

View reviewed changes

ryzhyk added this pull request to the merge queue Feb 9, 2026

abhizer removed this pull request from the merge queue due to a manual request Feb 9, 2026

docs: avro: document workers parameter

04edf4a

Adds documentation for the `workers` parameter on the Avro format configuration. Default: 4. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer force-pushed the issue5340 branch from bcb51e0 to 04edf4a Compare February 9, 2026 20:21

abhizer enabled auto-merge February 9, 2026 20:21

abhizer added this pull request to the merge queue Feb 9, 2026

Merged via the queue into main with commit 03b5144 Feb 9, 2026
1 check passed

abhizer deleted the issue5340 branch February 9, 2026 21:39

Conversation

ryzhyk commented Jan 2, 2026

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhizer commented Feb 8, 2026

Uh oh!

gz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants