[dbsp] Wait up to an hour for inter-host exchange to complete.#5622
Conversation
It can take an arbitrary amount of time for exchange to complete, given that steps have arbitrary size and exchange doesn't necessarily run in the same order in every worker. This relaxes the deadline from 10 seconds to 1 hour. Possibly this is a solution to the DeadlineExceeded errors that have been occasionally reported to me in multihost (it will at least eliminate a too-short deadline as the problem). Signed-off-by: Ben Pfaff <blp@feldera.com>
mihaibudiu
left a comment
There was a problem hiding this comment.
I don't see the 10 seconds in the previous code.
Is there a legit case where even 1 hour is not enough?
It's the default for tarpc: https://docs.rs/tarpc/latest/src/tarpc/context.rs.html#109-121
In theory, I suppose. |
|
10 seconds sounds little even for intra-host exchange |
|
@blp if there is legitimately a network issue between the hosts, will it take an hour to detect the failure and restart the step? |
TCP should find the problem long before that. We need a strategy for detecting network partitions but this probably isn't it. |
It can take an arbitrary amount of time for exchange to complete, given that steps have arbitrary size and exchange doesn't necessarily run in the same order in every worker. This relaxes the deadline from 10 seconds to 1 hour.
Possibly this is a solution to the DeadlineExceeded errors that have been occasionally reported to me in multihost (it will at least eliminate a too-short deadline as the problem).