Parquet Incremental Sync#768
Conversation
… into the parquet table
…ds, interfacing with ConversionSource
|
I can do first review for this @the-other-tim-brown @vinishjail97 |
|
@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit. |
|
XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync. |
|
@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR. Design Principle
Architecture Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR. |
…ds using the FileStatus' modifTime attribute
…ificationTime selector
…ppend and 2) filter for sync
| } | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Line 184 needs to be updated to include INCREMENTAL as well
… ParquetDataManager and adjusted operator for mostRecentFile + spotless
… ParquetDataManager and adjusted operator for mostRecentFile + spotless + import error fixed
…uetDataManager methods - ParquetConversionSource: materialize the file listing once per operation in getTableChangeForCommit and getCurrentSnapshot instead of re-listing the filesystem 2-3 times (addresses the 'fetch the list once' review comment). - ParquetDataManager: remove getParquetDataFileAt and getParquetFilesMetadataAfterTime; both had no production caller (only tests). - TestParquetDataManager: drop tests for the removed methods and repoint the real-filesystem tests to getCurrentFilesInfo.
getTableChangeForCommit computed filesAdded and the table from one materialized listing but derived the committed sourceIdentifier from a separate parquetDataManager.getMostRecentParquetFile() listing. A file landing between the two listings could advance the committed identifier past a file that was never included in filesAdded, permanently skipping it on the next incremental sync. Use the most recent file from the same snapshot for both the table and the source identifier. Add an assertion in ITParquetConversionSource that the committed identifier matches that snapshot's latest mod time.
What is the purpose of the pull request
Adds incremental syncing ability to the ParquetSource
Brief change log
Verify this pull request