[feat] provide save/load checkpoint interfaces#124
Open
dodatboii wants to merge 1 commit into
Open
Conversation
CLA Signature Passdodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
0oshowero0
reviewed
Jun 15, 2026
CLA Signature Passdodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature Passdodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
0oshowero0
reviewed
Jun 16, 2026
| CHECKPOINT_DUMP_RESPONSE = "CHECKPOINT_DUMP_RESPONSE" | ||
| CHECKPOINT_RESTORE = "CHECKPOINT_RESTORE" | ||
| CHECKPOINT_RESTORE_RESPONSE = "CHECKPOINT_RESTORE_RESPONSE" | ||
| SAVE_LOAD_CKPT_ERROR = "SAVE_LOAD_CKPT_ERROR" |
0oshowero0
reviewed
Jun 16, 2026
Comment on lines
+105
to
+108
| CHECKPOINT_DUMP = "CHECKPOINT_DUMP" | ||
| CHECKPOINT_DUMP_RESPONSE = "CHECKPOINT_DUMP_RESPONSE" | ||
| CHECKPOINT_RESTORE = "CHECKPOINT_RESTORE" | ||
| CHECKPOINT_RESTORE_RESPONSE = "CHECKPOINT_RESTORE_RESPONSE" |
Collaborator
There was a problem hiding this comment.
Distinguish between controller zmq event & simple storage zmq event
0oshowero0
reviewed
Jun 16, 2026
Comment on lines
+355
to
+366
| @property | ||
| def storage_checkpoint_required(self) -> bool: | ||
| """Whether storage contents must be checkpointed for correct restore. | ||
|
|
||
| Returns True for in-memory backends (e.g. SimpleStorage) where data | ||
| is lost on restart and must be serialized. Returns False for persistent | ||
| KV backends (e.g. MooncakeStore, Yuanrong) where data survives restarts | ||
| and only controller metadata needs to be saved. | ||
|
|
||
| Subclasses should override this to reflect their actual persistence model. | ||
| """ | ||
| return False |
Collaborator
There was a problem hiding this comment.
This property is strange
0oshowero0
reviewed
Jun 16, 2026
Comment on lines
+368
to
+385
| async def dump_checkpoint(self, output_dir: str) -> list[dict]: | ||
| """Dump all storage units to files in output_dir. | ||
|
|
||
| Returns: | ||
| List of dicts, each with keys: position, storage_unit_id. | ||
|
|
||
| Raises: | ||
| NotImplementedError: If this storage backend does not support checkpoint. | ||
| """ | ||
| raise NotImplementedError(f"{self.__class__.__name__} does not support checkpoint") | ||
|
|
||
| async def restore_checkpoint(self, checkpoint_dir: str, su_info_list: list[dict]) -> None: | ||
| """Restore all storage units from files in checkpoint_dir. | ||
|
|
||
| Args: | ||
| checkpoint_dir: Path to the checkpoint directory containing storage unit files. | ||
| su_info_list: Ordered list of storage unit info dicts from metadata.json. | ||
|
|
Collaborator
There was a problem hiding this comment.
These interfaces are not universal to other backends
0oshowero0
reviewed
Jun 16, 2026
| logger.error(f"[{self.controller_id}]: dump_to_file failed: {e}") | ||
| return False | ||
|
|
||
| def restore_from_file(self, path: str) -> bool: |
Collaborator
There was a problem hiding this comment.
Unify the interface name
Signed-off-by: yxstev <zhangyixiang9@huawei.com>
CLA Signature Passdodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tq.save_checkpoint(checkpoint_dir, *, include_storage, metadata)andtq.load_checkpoint(checkpoint_dir)as top-level public APIs.tmpdirectory that is renamed on success and deleted on failure, ensuring no partial checkpoint is left on diskglobal_idx % num_units) so storage unit IDs regenerated across restarts are handled correctlyCheckpoint layout:
Test plan
pytest tests/e2e/test_checkpoint_e2e.py -vinclude_storage=Falsesaves only controller state