Skip to content

Reduce the memory usage that is important for ne1024 simulation#414

Merged
billsacks merged 5 commits into
ESCOMP:mainfrom
sjsprecious:use_existing_esmf_mesh
Jul 2, 2026
Merged

Reduce the memory usage that is important for ne1024 simulation#414
billsacks merged 5 commits into
ESCOMP:mainfrom
sjsprecious:use_existing_esmf_mesh

Conversation

@sjsprecious

@sjsprecious sjsprecious commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

This PR introduces some changes in CDEPS that will be used in CTSM later and are critical to reduce memory usage of a simulation at ne1024 resolution. All the changes are done by Claude under my supervisory.

The goal is to avoid the construction of a duplicate ESMF mesh for streams whose data already live on the model grid. When a stream uses mapalgo = 'redist' the stream→model mapping is a 1-to-1 index copy, so the stream mesh and the model mesh are necessarily identical. Previously shr_strdata_init always read the stream mesh file and called ESMF_MeshCreate to build a second, full ESMF mesh — at ne1024 resolution that duplicate (NetCDF mesh-file read + ESMF mesh build) is a large initialization memory and time cost.

All edits are in streams/dshr_strdata_mod.F90:

  1. Reuse the model mesh for redist streams. At the mesh-creation site in shr_strdata_init, when mapalgo == 'redist' (and the stream has a mesh file) the stream mesh is set to the already-built model mesh (sdat%model_mesh) instead of building a new one from file:
if (trim(sdat%stream(ns)%mapalgo) == 'redist' .and. filename /= 'none') then
   call shr_strdata_check_redist_size(sdat, ns, trim(filename), rc=rc)
   if (ChkErr(rc,__LINE__,u_FILE_u)) return
   stream_mesh = sdat%model_mesh
else if (filename /= 'none') then
   stream_mesh = ESMF_MeshCreate(trim(filename), fileformat=ESMF_FILEFORMAT_ESMFMESH, rc=rc)
   if (ChkErr(rc,__LINE__,u_FILE_u)) return
endif

This is a handle copy (the same assignment already used for the file-built mesh), so no second ESMF mesh is constructed. The per-stream stream_mesh is only ever read downstream (there is no ESMF_MeshSet/ESMF_MeshDestroy anywhere in the streams code), so sharing the model-mesh handle is safe — it is never mutated or freed through the stream slot.

  1. New guard shr_strdata_check_redist_size. redist requires the stream and model grids to have the same global size. The new routine opens the stream mesh file and reads only its elementCount dimension — cheap metadata, not the full mesh, so the memory/time saving is preserved — and compares it to the model mesh global size (sdat%model_gsize). On a mismatch it aborts with a clear message stating that mapalgo='redist' requires matching grids and to use a non-redist mapalgo (consf/consd/nn/bilinear) for a different stream grid. If the mesh file is not in ESMF format (no elementCount dimension) the check is skipped and any problem is left to the downstream read. The PIO calls mirror the existing shr_strdata_get_stream_nlev idiom and are collective over the PIO subsystem; every rank holds the same model_gsize and reads the same elementCount, so the comparison is identical on all ranks.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a useful feature.

However: Please change the ordering of arguments so that rc remains the last argument in the subroutine definitions - i.e., putting the new stream_mesh_in argument before the rc argument. Please do this for both (1) shr_strdata_init (this is private to this module so the argument order for any callers can easily be changed if needed), and (2) shr_strdata_init_from_inline. (2) is public so in theory could impact callers in other repositories; however, this seems pretty unlikely since there are already optional arguments appearing before rc, so presumably callers are already using named arguments at that point in the call... but this should be verified by doing builds of at least (a) an A compset (all data models) and (b) a B compset (e.g., B1850C_LTso).

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @billsacks for your comment. I just made your suggested changes.

One thing I want to point out, putting rc as the last argument for the shr_strdata_init subroutine violates the Fortran convention because we now put rc (a mandatory argument) after stream_mesh_in (an optional argument).

@sjsprecious sjsprecious requested a review from billsacks June 26, 2026 04:31
@billsacks

Copy link
Copy Markdown
Member

Thank you for making that change, @sjsprecious .

One thing I want to point out, putting rc as the last argument for the shr_strdata_init subroutine violates the Fortran convention because we now put rc (a mandatory argument) after stream_mesh_in (an optional argument).

I wondered about that, too. But I noticed that there was already an optional argument before rc (stream_name), so I figured this was no worse, and was keeping things consistent. Do you have a reference for this Fortran convention stating that optional arguments should all come after mandatory arguments? I do see that as more intuitive, but from what I can tell, it isn't required... but, from a quick search, I haven't been able to find something definitive one way or the other. Of course, if a required argument comes after optional arguments, you will generally need to use name=value in the caller, but I find that to be a good practice anyway.

What testing have you done on the latest version? If you haven't already, can you please at least test the build for an A compset and a B compset?

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @billsacks . Here is ECMWF IFS Fortran coding standard where it states "Optional arguments to a routine shall be after non-optional ones" (https://sites.ecmwf.int/docs/ifs-arpege-coding-standards/fortran/rules/L6.html).

However, my understanding is that this is a convention, not a hard requirement. Thus as long as the caller is using the named argument, the current changes should be fine. But if we rely on the caller to use named argument, then your suggested change to put rc as the last argument really does not matter here.

I have run the FHISTC_LTso compset successfully with these changes. Can you share the detailed instructions about building an A compset and a B compset on Derecho? I am happy to test them afterwards.

@jtruesdal

Copy link
Copy Markdown

Thanks @sjsprecious for these changes. I would recommend passing mapalgo as an optional parameter instead of a duplicate mesh_in parameter. If the user explicitly sets mapalgo = 'redist', that algorithm choice should dictate the subroutine's behavior. Passing an optional mesh_in parameter or relying on the lower-level routines to infer what to do based on whether a mesh file string is populated is confusing. If the model is performing a 1-to-1 index redistribution, the routine doesn't not need a mesh parameter at all since the model is just doing an array copy.

I'm also concerned about error checking. I assume the redistribution will fail if the model mesh is a different size from the offline file for the redist option. I'm not sure if the corrective action will be apparent from the existing error. Maybe just verifying the dimensions of the model and mesh file and erroring out with a message letting the user know that when mapalgo='redist' is set the grid dimensions of the model and mesh file must match.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @jtruesdal for commenting on the interface change and error checking. I have made the code changes you suggest at af456c0 and let me know if it looks better to you now.

@billsacks billsacks requested a review from mvertens June 29, 2026 22:27
@billsacks

Copy link
Copy Markdown
Member

@mvertens - I'd be very happy to have your eyes on this as well, if you have time... if you don't have time, I at least want you to be aware of it.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtruesdal 's points were very good. @sjsprecious thank you for addressing them. The code looks good as far as I can tell, but I'm also requesting a review from @mvertens .

I do have a few requests for testing / documentation before this is merged:

(1) The new change seems better than the original version, but now has the potential to impact more configurations, so I'd like to see additional testing before this is merged. I'm thinking at least aux_cdeps with baseline comparisons and aux_cime_baselines with baseline comparisons (either proactively or keeping a careful eye on it the nightly testing after this is merged). @sjsprecious - if you are unclear on how to do this, can you please work with @fischer-ncar ? @fischer-ncar - also please feel free to give more guidance on what testing would be appropriate here.

(2) I also want to check: how confident are you that the assignment stream_mesh = sdat%model_mesh is safe and correct when the decomposition of the stream differs from the decomposition of the model - i.e., when the redist actually has something to do? (I'm not clear on whether the mesh returned from ESMF_MeshCreate is tied to the decomposition.) If you haven't already done so, can you verify that this is safe through some combination of research and testing (with a test case that covers this situation where the decompositions differ)?

(3) Now that the implementation differs from your original, please change the top-level PR comment accordingly.

Comment on lines +503 to +512
if (trim(sdat%stream(ns)%mapalgo) == 'redist' .and. filename /= 'none') then
! A 'redist' stream is on the same grid as the model: the stream->model mapping is a 1-to-1 index copy,
! so the stream mesh and the model mesh are identical. Reuse the already-built model mesh instead of reading
! the stream mesh file and constructing a duplicate full ESMF mesh -- that duplicate is a large init memory
! and time cost at high resolution.
! Only stream_mesh is read below (no MeshSet/MeshDestroy), so sharing the model-mesh handle is safe.
! First verify the grids really match (redist requires identical global sizes) and error clearly if not.
call shr_strdata_check_redist_size(sdat, ns, trim(filename), rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
stream_mesh = sdat%model_mesh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvertens - I especially would welcome your input on whether the assignment of stream_mesh = sdat%model_mesh seems general and safe here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me. There is already a copy being done in the inline call - so this is should be fine as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjsprecious - I talked with @mvertens about this. She has convinced me that it's fine. However, the reason this is safe is potentially subtle, and I think warrants an additional comment - which could add to the comment above (or possibly replace some of the text in the existing comment). Can you add something like this, if this sounds good / correct to you?

We call this a redist, but in fact the model decomposition is the same as the stream decomposition, so the redist ends up simply being a copy. This makes it safe to set the stream mesh equal to the model mesh - since both the underlying grids/meshes and their decompositions are the same.

@billsacks

Copy link
Copy Markdown
Member

(2) I also want to check: how confident are you that the assignment stream_mesh = sdat%model_mesh is safe and correct when the decomposition of the stream differs from the decomposition of the model - i.e., when the redist actually has something to do? (I'm not clear on whether the mesh returned from ESMF_MeshCreate is tied to the decomposition.) If you haven't already done so, can you verify that this is safe through some combination of research and testing (with a test case that covers this situation where the decompositions differ)?

I'm reaching out to the ESMF group to get their take on this but I also welcome your thoughts and would be happy to see a test that confirms that this works - though I'm not sure myself what configuration would exercise this.

@sjsprecious

sjsprecious commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @billsacks .

(1) The new change seems better than the original version, but now has the potential to impact more configurations, so I'd like to see additional testing before this is merged. I'm thinking at least aux_cdeps with baseline comparisons and aux_cime_baselines with baseline comparisons (either proactively or keeping a careful eye on it the nightly testing after this is merged). @sjsprecious - if you are unclear on how to do this, can you please work with @fischer-ncar ? @fischer-ncar - also please feel free to give more guidance on what testing would be appropriate here.

Happy to work with @fischer-ncar on these tests. While I am working on other PRs for similar memory fixes, I noticed that I did not have permission to access the baseline on Derecho. Can you and someone grant me the access before @fischer-ncar shares more details about the test setups?

(2) I also want to check: how confident are you that the assignment stream_mesh = sdat%model_mesh is safe and correct when the decomposition of the stream differs from the decomposition of the model - i.e., when the redist actually has something to do? (I'm not clear on whether the mesh returned from ESMF_MeshCreate is tied to the decomposition.) If you haven't already done so, can you verify that this is safe through some combination of research and testing (with a test case that covers this situation where the decompositions differ)?

I have no ideas about this part. If you have any suggested tests or hear back from ESMF team, I am happy to do them on Derecho for confirmation.

(3) Now that the implementation differs from your original, please change the top-level PR comment accordingly.

Done.

@billsacks

Copy link
Copy Markdown
Member

I did not have permission to access the baseline on Derecho

Are you having trouble with read permission or write permission? Spot-checking some baselines, it looks like they should have world read permission. If there are specific directories that you're having trouble accessing, @fischer-ncar can hopefully help with that. If the issue is writing: in general, writing one-off baselines should go somewhere else. (When I need to do both reads and writes of baselines, I create my own baseline directory in scratch space, sym link from there to the pre-existing baselines I want to compare against, and then point to that directory in my space for baseline generation & comparison.)

(2) I also want to check: how confident are you that the assignment stream_mesh = sdat%model_mesh is safe and correct when the decomposition of the stream differs from the decomposition of the model - i.e., when the redist actually has something to do? (I'm not clear on whether the mesh returned from ESMF_MeshCreate is tied to the decomposition.) If you haven't already done so, can you verify that this is safe through some combination of research and testing (with a test case that covers this situation where the decompositions differ)?

I have no ideas about this part. If you have any suggested tests or hear back from ESMF team, I am happy to do them on Derecho for confirmation.

Okay. This feels like an important thing to confirm, either from someone who has a deep enough understanding of the workings of these mesh structures or through figuring out a test that can be done. I want to hold off on bringing this in until we get confirmation through at least one of those means.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Are you having trouble with read permission or write permission? Spot-checking some baselines, it looks like they should have world read permission. If there are specific directories that you're having trouble accessing, @fischer-ncar can hopefully help with that. If the issue is writing: in general, writing one-off baselines should go somewhere else. (When I need to do both reads and writes of baselines, I create my own baseline directory in scratch space, sym link from there to the pre-existing baselines I want to compare against, and then point to that directory in my space for baseline generation & comparison.)

For my CTSM PR, I did not have read access to the directory /glade/campaign/cgd/tss/ctsm_baselines/ctsm5.4.044. I did not know where the baselines are saved for your suggested aux_cdeps and aux_cime_baselines tests, but I wonder that I will run into a similar issue later, before @fischer-ncar shares more details about the test setup.

Okay. This feels like an important thing to confirm, either from someone who has a deep enough understanding of the workings of these mesh structures or through figuring out a test that can be done. I want to hold off on bringing this in until we get confirmation through at least one of those means.

Thanks @billsacks. I understand your concern and hopefully we can find the answer soon.

@billsacks

Copy link
Copy Markdown
Member

Non-CTSM baselines are here:

/glade/campaign/cesm/cesmdata/cseg/cesm_baselines

I'm not sure about aux_cime_baselines

Comment on lines +503 to +512
if (trim(sdat%stream(ns)%mapalgo) == 'redist' .and. filename /= 'none') then
! A 'redist' stream is on the same grid as the model: the stream->model mapping is a 1-to-1 index copy,
! so the stream mesh and the model mesh are identical. Reuse the already-built model mesh instead of reading
! the stream mesh file and constructing a duplicate full ESMF mesh -- that duplicate is a large init memory
! and time cost at high resolution.
! Only stream_mesh is read below (no MeshSet/MeshDestroy), so sharing the model-mesh handle is safe.
! First verify the grids really match (redist requires identical global sizes) and error clearly if not.
call shr_strdata_check_redist_size(sdat, ns, trim(filename), rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
stream_mesh = sdat%model_mesh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me. There is already a copy being done in the inline call - so this is should be fine as well.

@jtruesdal

Copy link
Copy Markdown

Thanks the changes look good to me too.

@fischer-ncar

Copy link
Copy Markdown
Collaborator

The aux_cime_baselines are in /glade/campaign/cesm/cesmdata/cesm_baselines/nightly_current. But they are updated nightly with updated externals. So it'll be hard to reproduce.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just requesting an additional comment. Besides that, this looks good once the requested testing has been done.

Comment on lines +503 to +512
if (trim(sdat%stream(ns)%mapalgo) == 'redist' .and. filename /= 'none') then
! A 'redist' stream is on the same grid as the model: the stream->model mapping is a 1-to-1 index copy,
! so the stream mesh and the model mesh are identical. Reuse the already-built model mesh instead of reading
! the stream mesh file and constructing a duplicate full ESMF mesh -- that duplicate is a large init memory
! and time cost at high resolution.
! Only stream_mesh is read below (no MeshSet/MeshDestroy), so sharing the model-mesh handle is safe.
! First verify the grids really match (redist requires identical global sizes) and error clearly if not.
call shr_strdata_check_redist_size(sdat, ns, trim(filename), rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
stream_mesh = sdat%model_mesh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjsprecious - I talked with @mvertens about this. She has convinced me that it's fine. However, the reason this is safe is potentially subtle, and I think warrants an additional comment - which could add to the comment above (or possibly replace some of the text in the existing comment). Can you add something like this, if this sounds good / correct to you?

We call this a redist, but in fact the model decomposition is the same as the stream decomposition, so the redist ends up simply being a copy. This makes it safe to set the stream mesh equal to the model mesh - since both the underlying grids/meshes and their decompositions are the same.

@fischer-ncar

Copy link
Copy Markdown
Collaborator

@sjsprecious I'll run the aux_cdeps and aux_cime_baselines tests for you. I'll let you know the results once they finished.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

@sjsprecious I'll run the aux_cdeps and aux_cime_baselines tests for you. I'll let you know the results once they finished.

That is awesome! Thanks @fischer-ncar for your time and help.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Just requesting an additional comment. Besides that, this looks good once the requested testing has been done.

Thanks @billsacks . I have updated the comment based on your suggsetion.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way you worked this comment in - thank you!

So I feel this is ready once @fischer-ncar reports back on the testing.

@fischer-ncar

Copy link
Copy Markdown
Collaborator

The aux_cdeps and aux_cime_baselines expected passing tests all passed. So looks good.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

The aux_cdeps and aux_cime_baselines expected passing tests all passed. So looks good.

Thanks @fischer-ncar for doing the tests so quickly and confirming the results.

@billsacks billsacks merged commit f13cffb into ESCOMP:main Jul 2, 2026
1 check passed
@sjsprecious sjsprecious deleted the use_existing_esmf_mesh branch July 2, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants