Parquet Incremental Sync by sapienza88 · Pull Request #768 · apache/incubator-xtable

sapienza88 · 2025-12-10T19:54:49Z

What is the purpose of the pull request

Adds incremental syncing ability to the ParquetSource

Brief change log

Adds a new class ParquetDataManager.java for handling the fetching of data files for Parquet Source
Updates IT to include incremental source

Verify this pull request

new tests added to ITParquetConversionSource

… into the parquet table

…ds, interfacing with ConversionSource

rahil-c · 2025-12-15T16:19:52Z

I can do first review for this @the-other-tim-brown @vinishjail97

sapienza88 · 2025-12-17T19:55:39Z

@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit.

…ing)

vinishjail97 · 2025-12-22T19:46:35Z

XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync.
#550 (comment)

vinishjail97 · 2025-12-29T07:46:07Z

@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR.

Design Principle
XTable operates at a metadata level only. The current PR approach of writing new Parquet files with filtered data is incorrect. XTable should:

Discover existing Parquet files from storage
Generate table format metadata (Hudi, Iceberg, Delta) for those files
NEVER write new Parquet files or transform data.

Architecture

  ┌────────────────────────────────────────────────────────────┐
  │                  ParquetConversionSource                   │
  │  - Uses ParquetFileDiscovery to find files                 │
  │  - Converts file metadata to InternalDataFile              │
  │  - Returns snapshots and table changes                     │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │              ParquetFileDiscovery (new class)              │
  │  - Lists all .parquet files from filesystem                │
  │  - Filters files by modification time                      │
  │  - Returns lightweight file metadata                       │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │            FileSystem (HDFS/S3/GCS/Azure)                  │
  │  - fs.listFiles(basePath, recursive=true)                  │
  └────────────────────────────────────────────────────────────┘

Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR.

…ds using the FileStatus' modifTime attribute

…ificationTime selector

…ppend and 2) filter for sync

the-other-tim-brown · 2026-03-15T18:00:23Z

    }
  }

+  @Test


Line 184 needs to be updated to include INCREMENTAL as well

… ParquetDataManager and adjusted operator for mostRecentFile + spotless

… ParquetDataManager and adjusted operator for mostRecentFile + spotless + import error fixed

…source

…uetDataManager methods - ParquetConversionSource: materialize the file listing once per operation in getTableChangeForCommit and getCurrentSnapshot instead of re-listing the filesystem 2-3 times (addresses the 'fetch the list once' review comment). - ParquetDataManager: remove getParquetDataFileAt and getParquetFilesMetadataAfterTime; both had no production caller (only tests). - TestParquetDataManager: drop tests for the removed methods and repoint the real-filesystem tests to getCurrentFilesInfo.

getTableChangeForCommit computed filesAdded and the table from one materialized listing but derived the committed sourceIdentifier from a separate parquetDataManager.getMostRecentParquetFile() listing. A file landing between the two listings could advance the committed identifier past a file that was never included in filesAdded, permanently skipping it on the next incremental sync. Use the most recent file from the same snapshot for both the table and the source identifier. Add an assertion in ITParquetConversionSource that the committed identifier matches that snapshot's latest mod time.

given a parquet file return data from a certain modification time

e541a71

sapienza88 changed the title ~~Parquet Incremental Sync: Given a parquet file return data from a certain modification time~~ Parquet Incremental Sync Dec 10, 2025

Selim Soufargi added 3 commits December 13, 2025 18:20

create the path based on the partition then inject the file to append…

15e282a

… into the parquet table

Handle case of path construction with file partitioned over many fiel…

2ee71c9

…ds, interfacing with ConversionSource

test append Parquet file into table init

6032e5f

add function to test schema equivalence before appending

f6fdc72

vinishjail97 self-requested a review December 16, 2025 08:31

Selim Soufargi added 2 commits December 16, 2025 12:59

construct path to inject to based on partitions

a94c3f3

fix imports

f8bdbfe

vinishjail97 requested changes Dec 17, 2025

View reviewed changes

refactoring (lombok, logs, javadocs and function and approach comment…

c04a983

…ing)

Selim Soufargi added 15 commits January 1, 2026 18:03

use appendFile to append a file into a table while tracking the appen…

5f2541e

…ds using the FileStatus' modifTime attribute

find the files that satisfy to the time condition

47e7076

treat appends as separate files to add in the target partition folder

fbb09ec

update approach: selective block compaction

fe19a60

update approach: added a basic test to check data selection using mod…

da7f300

…ificationTime selector

fix append based on partition value

a8730b7

fix test with basic example where partitions are not considered

d19ccbf

fix test with basic example where partitions are not considered2

aecb204

fix test with basic example where partitions are not considered3

0ec8cbb

test with time of last append is now

9cb75df

test appendFile with Parquet: TODO test with multiple partitions 1) a…

9e125f2

…ppend and 2) filter for sync

merge recursively one partition files

233ca77

fix paths for files to append

b4cba5a

fix bug of appending file path

a564b29

fix bug of schema

d1ceafb

Selim Soufargi added 3 commits March 15, 2026 17:09

spotless

5364315

spotless imports

9e27f44

spotless imports

9c0ca1d

the-other-tim-brown reviewed Mar 15, 2026

View reviewed changes

Selim Soufargi and others added 6 commits March 15, 2026 20:15

add syncMode Incr

51e8dd9

revert changes

f46305d

add syncMode Incr

5733dbe

spotless:apply

d69d944

update naming

2e826e1

minimize diff with main

7975980

vinishjail97 reviewed Apr 20, 2026

View reviewed changes

Selim Soufargi and others added 14 commits April 20, 2026 22:42

added test for tableChangeAddedFiles and changed path construction of…

292bfbe

… ParquetDataManager and adjusted operator for mostRecentFile + spotless

added test for tableChangeAddedFiles and changed path construction of…

d7076f6

… ParquetDataManager and adjusted operator for mostRecentFile + spotless + import error fixed

tests fixed

daae5f2

tests fixed

df448cf

tests fixed

ac3f534

log for actual changes

47d0d82

log for actual changes

1d70041

tests fixed

1a18297

spotless

a8f713d

removed comment

350b74f

spotless

75fcdde

address PR comments

87c946d

fix path

f0a5623

add safety check around handling of out of sync targets with parquet …

83df9da

…source

vinishjail97 reviewed Jun 5, 2026

View reviewed changes

vinishjail97 added 2 commits June 5, 2026 12:05

vinishjail97 approved these changes Jun 5, 2026

View reviewed changes

vinishjail97 merged commit 754fd27 into apache:main Jun 5, 2026
2 checks passed

                   }
                 }
+                @Test

Conversation

sapienza88 commented Dec 10, 2025 • edited by the-other-tim-brown Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

vinishjail97 commented Dec 22, 2025

Uh oh!

vinishjail97 commented Dec 29, 2025

Uh oh!

the-other-tim-brown Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

sapienza88 Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sapienza88 commented Dec 10, 2025 •

edited by the-other-tim-brown

Loading