Skip to content

[checkpoint-sync] start_from_checkpoint ignores existing local checkpoints #5609

@abhizer

Description

@abhizer

Today, when start_from_checkpoint is set to latest, the pipeline always pulls the latest checkpoint from S3, even if a newer local checkpoint exists. This was done to make the pipeline state deterministic but in the present customer use cases, this is clearly a bad design.

Potential solutions
  1. Compare the progress made by the local checkpoint vs remote checkpoint and pick the one that has made more progress (we can do this by comparing the number of steps made, and the number of records ingested). (Preferred)

  2. Allow the user to specify a preference:

    1. prefer_remote: always prefer the remote checkpoint. May be suitable for standby pipelines
    2. prefer_local: always prefer the local checkpoint, if one is available

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghigh priorityTask should be tackled first, added in the current sprint if necessary

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions