Skip to content

Reduce the memory usage that is important for ne1024 simulation#665

Open
sjsprecious wants to merge 5 commits into
ESCOMP:mainfrom
sjsprecious:add_lnd2rof_map_files
Open

Reduce the memory usage that is important for ne1024 simulation#665
sjsprecious wants to merge 5 commits into
ESCOMP:mainfrom
sjsprecious:add_lnd2rof_map_files

Conversation

@sjsprecious

Copy link
Copy Markdown
Contributor

This PR reduces the memory usage that is critical for ultra-high resolution simulation such as ne1024. All the code changes are done by Claude under my supervisory.

The goal is to let the lnd→rof conservative coupling maps be read from offline weight files instead of being computed online, which OOM-kills the job at ne1024 during DataInitialize (ESMF_FieldRegridStore → GeomRendezvous → Zoltan_RCB). This is the implementation of the lnd2rof_consf OOM fix. There are two distinct lnd→rof maps handled, plus an aoflux memory optimization.

  1. Flux/runoff coupling map — bug fix

In main, the lnd2rof_map attribute (driven by the pre-existing LND2ROF_FMAPNAME XML var) was already read into the lnd2rof_map variable — but the five addmap_from calls for the runoff fields (Flrl_rofsur, Flrl_rofi, Flrl_rofgwl, Flrl_rofsub, Flrl_irrig) hardcoded unset, so the file was silently ignored and the map was always built online.

  1. Fraction-init map — new feature

This is a separate map (destarea, not fracarea) used during med_fraction_init, with no pre-existing namelist hook. The fix adds full new plumbing:

  • New XML entry LND2ROF_FRAC_FMAPNAME (default unset, env_run.xml, group run_domain) in config_component.xml.
  • New driver namelist lnd2rof_fmap (modify_via_xml="LND2ROF_FRAC_FMAPNAME") in namelist_definition_drv.xml.
  • med_fraction_mod.F90: reads the lnd2rof_fmap attribute via NUOPC_CompAttributeGet; if it is present, set, and the file exists on disk (inquire), it calls med_map_routehandles_init with mapfile= to read offline weights; otherwise it falls back to the original online path — no behavior change when unset. Adds a use NUOPC import and local vars (isPresent/isSet/lexist, lnd2rof_fmap).
  • med_map_mod.F90: threads a new optional, intent(in) :: mapfile argument through med_map_routehandles_initfrom_fieldbundle, forwarding it to the field-level routine (which already accepted mapfile) when present. Backward-compatible.
  1. aoflux mesh memory optimization

Replaces is_local%wrap%aoflux_mesh = ESMF_MeshCreate(lmesh, rc=rc) with is_local%wrap%aoflux_mesh = lmesh — reuses the ocean field's existing mesh handle instead of allocating a duplicate full ESMF mesh per rank at ne1024pg2. Safe because lmesh persists in FBArea(compocn) and aoflux_mesh is never destroyed.

@billsacks billsacks requested review from billsacks and mvertens June 26, 2026 00:05
@billsacks

Copy link
Copy Markdown
Member

@mvertens - If you have a chance, I'd like to hear any thoughts you have on this change. (I have just skimmed it so far - haven't had a chance to look closely yet.)

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall changes here look good to me. I appreciate your plugging in the use of lnd2rof_map, which seems like it previously wasn't hooked up. And I think I see why you needed to introduce a separate lnd2rof_fmap (though see my comments asking for this to be made more explicit in some documentation).

I do have a few requests... many about editing some comments, but a couple slightly more substantial... but still overall minor - for the most part this looks good to me.

Once you make the final changes, I'd like to see that at least one or two tests have been run with baseline comparisons to verify that these changes work and are bit-for-bit in out-of-the-box configurations. One test that I think would cover all of these changes would be a B compset test with the aoflux grid set to ogrid (the setting of ogrid is needed to cover the changes in med_phases_aofluxes_mod, and should still cover the other changes here). I'd like to see that run with comparisons against a baseline.

Comment thread mediator/med_phases_aofluxes_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread cime_config/namelist_definition_drv.xml Outdated
Comment thread cime_config/config_component.xml Outdated
Comment thread cime_config/config_component.xml Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
@sjsprecious sjsprecious requested a review from billsacks June 26, 2026 20:26
@jtruesdal

Copy link
Copy Markdown
Contributor

@sjsprecious and @billsacks I was also thinking about a test to show that the offline and inline are identical but in the past I've run into roundoff errors in the way the model calculates the mesh compared to offline. If that's the case here you may want to implement some type of tolerance. Maybe a nonissue if they are identical in this case. Other than that I think with the current set of updates the code looks good to me.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @jtruesdal for your comments. I think one thing that can contribute to the difference here is that users may use different compiler/ESMF versions when generating the offline mesh.

Once Bill provides the instructions about running the tests he suggested, I will let you know at least whether the code changes affect the current behavior.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest set of changes looks good - thank you @sjsprecious for making those changes.

A piece I am less confident about is whether the assignment of aoflux_mesh to lmesh (as opposed to doing an ESMF_MeshCreate there) is always safe. I am asking the ESMF group about that, along with the similar change you made in CDEPS.

Regarding testing, a good test would be SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid, since this covers the change to med_phases_aofluxes_mod in addition to your other changes. This is in the prebeta test list. It hasn't been run on recent versions of the code, so you'll need to generate baselines and then run with your latest changes with comparisons against baselines.

It would also be good to run aux_cime_baselines, or at least keep a careful eye on aux_cime_baselines when they are run in their nightly run following the merge of this PR. As with the associated CDEPS PR, @fischer-ncar could provide some input here on testing.

Regarding @jtruesdal 's point: I wasn't thinking of something as rigorous as testing behavior with online vs. offline generation of mappings - I'm mainly wanting to confirm that the changes here don't break anything or change answers for the case where we're still using online mapping.

@billsacks

Copy link
Copy Markdown
Member

I talked with @mvertens today.

The good news is that she feels that this is NOT a problem:

A piece I am less confident about is whether the assignment of aoflux_mesh to lmesh (as opposed to doing an ESMF_MeshCreate there) is always safe. I am asking the ESMF group about that, along with the similar change you made in CDEPS.

However, @mvertens raised a question / concern about having two different mapping files: LND2ROF_FRAC_FMAPNAME and LND2ROF_FMAPNAME. I think I see why this was done, but @mvertens questioned whether this was necessary.

Our initial question on this is: What is the difference in how these mapping files are generated? My sense is that LND2ROF_FMAPNAME should be generated with --norm_type fracarea whereas LND2ROF_FRAC_FMAPNAME should be generated with --norm_type dstarea (or no --norm_type, since dstarea is the default). Is that what was done? If so, can you add some documentation of this difference to the relevant xml and namelist definition documentation? We are also wondering if the resulting mapping files actually differ based on using these different --norm_type flags. Can you comment on this?

This discussion led me to find these related issues:

The summary of those issues is: Even though we currently have different mapping types for different lnd2rof mappings - and that's what you're supporting in the changes in this PR - this is considered to be a bug rather than a feature, and in fact we want to be using the same mapping type for all the mappings. So now I'm not sure what the right path forward is. I can see a few paths forward, which I lay out below... this probably needs some more discussion to figure out how we want to proceed.

(1) Keep the current changes and merge this as is

Pros:

  • Allows high-res runs to be done from main branch
  • Specifying pre-made mapping files will give the same behavior as you currently get

Cons:

  • Confusing for people using this, since the differences between the two LND2ROF maps are subtle, and extra work is needed to create both maps - and a user needs to know that they need to create both maps)
  • This will probably be reworked in the future, so we'll need to rework the changes here at that point

(2) Make an initial change where we change all of the lnd2rof mapping to be consd instead of consf, then change this PR to use a single LND2ROF mapping file and merge it

Pros:

  • Allows high-res runs to be done from main branch
  • Specifying pre-made mapping files will give the same behavior as online mapping (though note that this behavior will differ from the current behavior)
  • Simplifies the LND2ROF mapping files
  • Moves us in the direction we want to go, according to Change mapconsf mappings to mapconsd? #408

Cons:

  • I'd like to have more discussion of whether it's really correct to change all of the consf mappings to be consd as suggested in Change mapconsf mappings to mapconsd? #408... so this will take more time
  • The initial step may be answer changing; would want to investigate by how much
  • If anyone has lnd2rof mapping files already, and if these were generated with --norm_type fracarea, they will need to be regenerated with --norm_type dstarea

(3) Don't change online mapping behavior for now, but use the same mapping file for the lnd2rof frac mapping as for the other lnd2rof mappings... this relies on the assumption that the differences between consf and consd mapping are inconsequential (which #408 suggests, but I'm not sure it applies in all cases). Bring this to main in this way.

Pros:

  • Allows high-res runs to be done from main branch
  • Simplifies the LND2ROF mapping files
  • Moves us forward in the right direction, though not as far as option (2)

Cons:

  • Answers could differ between using a pre-computed mapping file vs. online mapping; this may be doing (slightly) the wrong thing for some of the lnd2rof mappings when using a pre-computed mapping file. We'd want to verify that the differences are small.
  • Puts some pressure on to resolve Change mapconsf mappings to mapconsd? #408 (at least for the lnd2rof mappings) to remove the above con.

(4) Defer bringing in these changes until post CESM3

Pros:

  • Avoids both (a) changing CESM results for now and (b) making changes that might be confusing or need to be redone later.
  • Specifying pre-made mapping files will give the same behavior as you currently get

Cons:

  • High-res runs will need to be done using this branch, which may need to be maintained for some period

(5) Hybrid between (1) and (4): bring in the changes on this branch that do NOT rely on the new mapping file (LND2ROF_FRAC_FMAPNAME); leave those additional changes on a branch for now.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @billsacks .

Our initial question on this is: What is the difference in how these mapping files are generated? My sense is that LND2ROF_FMAPNAME should be generated with --norm_type fracarea whereas LND2ROF_FRAC_FMAPNAME should be generated with --norm_type dstarea (or no --norm_type, since dstarea is the default). Is that what was done? If so, can you add some documentation of this difference to the relevant xml and namelist definition documentation?

Yes, LND2ROF_FMAPNAME is generated with --norm_type fracarea whereas LND2ROF_FRAC_FMAPNAME is generated with --norm_type dstarea. I have updated the comments in the XML files as suggested.

We are also wondering if the resulting mapping files actually differ based on using these different --norm_type flags. Can you comment on this?

Most fields between these two mapping files using different --norm_type options are identical. But the weights variable S does differ around 1.45e-13 (maximum absolute difference) or 6.7e-13 (maximum relative difference). Thus I would say they are not exactly the same but the difference seems subtle or at roundoff level.

So now I'm not sure what the right path forward is. I can see a few paths forward, which I lay out below... this probably needs some more discussion to figure out how we want to proceed.

I appreciate that you list all the available options here and provide detailed explanations. My two cents is that if my changes pass the regression tests that you suggested, it means that they do not change the current CMEPS behavior and should be safe to bring into CESM. I understand that providing two similar mapping files may be confusing, but if those XML variables are unset, the online mapping is used and the user may not even notice the existence of those XML variables.

Regarding the option to use the same consd or lnd2rof mapping, I feel that it is out of the scope of this PR and should be addressed by a separate PR (you have also listed a few related open issues), especially that those revisions may cause answer changing while this PR is expected to be BFB. I think it is fine to revise some work from PR afterwards once you decide how to address those issues.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Regarding testing, a good test would be SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid, since this covers the change to med_phases_aofluxes_mod in addition to your other changes. This is in the prebeta test list. It hasn't been run on recent versions of the code, so you'll need to generate baselines and then run with your latest changes with comparisons against baselines.

Thanks @billsacks . So I did the following steps:

  1. Checked out the CESM cesm3.0-alphabranch branch (commit: fafcb75) with the latest cmeps1.1.51 tag.
  2. Generated a new baseline for the test SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid on Derecho at /glade/derecho/scratch/sunjian/cesm_baselines/cesm_baseline_pr414.
  3. Checked out the same CESM cesm3.0-alphabranch branch (commit: fafcb75) but with my CMEPS changes here.
  4. Generated a test run for SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid on Derecho at /glade/derecho/scratch/sunjian/SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid.C.20260630_224152_f094m9 and compared to the baseline in Step 2.
  5. The comparison showed that except a namelist difference, all the results were identical.
  SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid (Overall: NLFAIL) details:
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid CREATE_NEWCASE
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid XML
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid SETUP
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid SHAREDLIB_BUILD time=233
    FAIL SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid NLCOMP
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid MODEL_BUILD time=259
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid SUBMIT
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid RUN time=790
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid BASELINE cesm_baseline_pr414:
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid TPUTCOMP
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid MEMLEAK insufficient data for memleak test
    PASS SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid SHORT_TERM_ARCHIVER

@billsacks

Copy link
Copy Markdown
Member

@sjsprecious - sorry that we've had a lot of back and forths here, but I think @mvertens and I have more clarity on the path forward based on our discussions and investigations this week.

First, thanks for the responses and the test you have run.

We realized that, with the way we think you're generating the offline mapping file, the two files you're providing here will always be effectively the same, within roundoff. That's because we think the land mesh you're using probably doesn't have a mask: that's the case for land meshes in general in CESM: the land mask is added at runtime based on the ocean grid/mask. When you have an unmasked source mesh, fracarea and dstarea mapping give identical results (within roundoff), so there is no value in providing these two separately. Furthermore, the fracarea-based map that you generate offline will differ from the fracarea-based map that's generated in the typical online workflow.

This made us worry at first that you would get wrong answers with this offline fracarea-based map, but after digging more deeply, we have convinced ourselves that this is probably okay - for the reason that #408 asserts that consf mappings can safely change to consd. (Briefly: there is a normalization applied in the course of applying the mapping weights that makes it irrelevant whether the underlying mappings were created using dstarea or fracarea normalization.)

What this all means for this PR is:

(1) We feel that it is confusing to have these two separate lnd2rof mapping files, since (a) they will always be identical given that the land mesh doesn't have a mask, (b) the fracarea one isn't really doing what fracarea is designed to do (given the lack of a land mask), and (c) it doesn't actually matter (beyond roundoff) whether you're using fracarea or dstarea normalization in these contexts. So we'd like you to remove the new xml and namelist variable, just using the single existing lnd2rof mapping file variable in all places in the code.

(2) We should change the description of this single lnd2rof variable to explain this. I can help with this, but before spending time on this, let's see what (3) shows.

(3) Since we aren't totally confident here, we feel it's important to see a demonstration that the lnd2rof mapped variables are the same - within roundoff - when providing an offline mapping file as they are when doing the standard online mapping (like @jtruesdal suggested). I'm thinking this could be easier if (a) you use a coarse grid so you don't run into the problems you have run into at this high resolution... just use the same method for creating the mapping file, but applied to the coarse grid; (b) if you use an I compset for a cheaper run and to avoid feedbacks, so I think changes in the LND2ROF fields shouldn't feed back to the rest of the system (an F compset would also avoid these feedbacks but would be much more expensive); and (c) turn on cplhist files (HIST_OPTION / HIST_N) so you can look at the mapped lnd2rof fields on the rof grid (looking at these fields for just one time step a few hours or more into a run will be sufficient). Seeing that these mapped lnd2rof fields are the same within roundoff when you do vs. don't provide a mapping file will give confidence that it's safe to use a single offline-generated mapping file here.

Let me know if you have questions about this.

@billsacks

Copy link
Copy Markdown
Member

(Referring back to my comment from a few days ago, we're suggesting going with option (3). Based on my analysis, the con listed there of doing slightly the wrong thing when using a pre-computed mapping file is actually a non-issue, but I'd like for this to be confirmed with a test run. Until we resolve #408, there is potential for confusion, in that we're using the same offline mapping file for two different mappings in the code, but hopefully a well-written comment can address this confusion, and it should be less confusing than needing to provide two different mapping files that end up being identical.)

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Ok, so I did the suggested simulations for option (3):

  1. Checked out the CESM cesm3.0-alphabranch branch (commit: fafcb75) with the latest cmeps1.1.51 tag.
  2. Generated a baseline for the FHISTC_LTso compset at ne30pg3 resolution on Derecho at /glade/derecho/scratch/sunjian/cam7/run/FHISTC_LTso.ne30pg3_ne30pg3_mg17.derecho.intel.mpich.gpu00_mpi128.online_mapping, using the default online mapping option and simulating one day.
  3. Generated a test run for the same FHISTC_LTso compset at ne30pg3 resolution on Derecho at /glade/derecho/scratch/sunjian/cam7/run/FHISTC_LTso.ne30pg3_ne30pg3_mg17.derecho.intel.mpich.gpu00_mpi128.offline_mapping, using the offline LND2ROF and ROF2LND map files and simulating one day.
  4. The comparison of cplhist files between these two runs shows that they are identical.

Note that I used the same esmf/8.9.1 version to generate the offline map files or do the online mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants