For very large datasets, profiling currently reads the whole file through polars. Backing the numeric profile with a t-digest and categoricals with a count-min/HLL sketch would let dsdiff build a profile in a single streaming pass and diff files that do not fit in memory, without changing the PSI semantics.
For very large datasets, profiling currently reads the whole file through polars. Backing the numeric profile with a t-digest and categoricals with a count-min/HLL sketch would let
dsdiffbuild a profile in a single streaming pass and diff files that do not fit in memory, without changing the PSI semantics.