C++ extensions of driver to read virtual Zarr datasets described in JSON metadata format. Optimize s3 reading performance and mult-dimensional reconstruction of array datasets.
See issue here: Unidata/netcdf-c#2777
- Enable absolute indexing (abstract chunks such that the index for h5 is automatically mapped)
- Enable local file reading (vs. just AWS remote bucket reading for stream of bytes)
- H5Coro Integration (see SlideRule repository)
- Call chain overview:
main.cpp->json_parse.h->kerchunk_read.h->prin_helpers.h->mult_dim_form.h-> Finish
main.cpp: main entry point for program; key calls include:json_parse()andkerchunk_read()
config.h: inputs and settings for program- e.g.
HARDCODED_CHUNK_INDEX,HARDCODED_JSON_PATH
- e.g.
custom_structs.h: hold custom structs shared across program (excludinglayer_t)json_parse.h: given json path, parse out metdata relevant for all chunks and for index specific chunksjson.hpp: nholmann json processing libraryiter_chunk.h: coordinate metadata extraction and runs for multiple chunk indexeskerchunk_read.h: given JSON metadata, read s3 stream and perform decompression, shuffling, etc until original array obtained. Calls onmult_dim_form.hto regain full dimensionsmult_dim_form.h: given flat array, reconstruct the full dimensions as originally stored (bytes read as single flat dimensions from s3 stream)print_helpers.h: debug printer functions, controlled by the constantDEBUG_PRINT_ONinconfig.h
make_kerchunk_refs.ipynb: ipynb to generate JSON metadata from select s3 objectrange_req_dynamic.ipynb: python edition of kerchunk process; use as verification and testing of c++ addition. Includes s3 byte stream, zlib decompression, unshuffle, dtype processing, and xarray comparison