Skip to content

Conversation

@samsrabin
Copy link
Contributor

@samsrabin samsrabin commented Jul 9, 2025

As part of ESCOMP/CTSM#3292, I made two SystemTests that exercise a command (subset_data) to generate CTSM input data and then run with the results. That command is exercised in the SystemTest's custom build_phase(), because it's the earlier of the two methods inherited from SystemTestsCommon that we're allowed to override.

Given the compset used by those tests, there are certain variables (at least fsurdat) that have no default. The tests worked fine when I wasn't comparing against or generating a baseline. However, when I did, I would get namelist build failures because the default was missing and the custom setting hadn't yet happened. TestStatus:

PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default CREATE_NEWCASE
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default XML
FAIL SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default SETUP

And the end of TestStatus.log:

 ---------------------------------------------------
2025-07-09 12:07:57: SETUP PASSED for test 'SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default'.
Command: ./case.setup
Output: job is case.test USER_REQUESTED_WALLTIME 00:20:00 USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Creating batch scripts
Writing case.test script from input template /glade/work/samrabin/ctsm_subsetdata-systemtest/ccs_config/machines/template.case.test
Creating file .case.test
Writing case.st_archive script from input template /glade/work/samrabin/ctsm_subsetdata-systemtest/ccs_config/machines/template.st_archive
Creating file case.st_archive
Writing case.cupid script from input template /glade/work/samrabin/ctsm_subsetdata-systemtest/ccs_config/machines/template.cupid
Creating file case.cupid
Creating user_nl_xxx files for components and cpl
If an old case build already exists, might want to run 'case.build --clean' before building
You can now run './preview_run' to get more info on how your case will be run


 ---------------------------------------------------
2025-07-09 12:08:03: Test 'SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default' failed in phase 'SETUP' with exception 'ERROR: Fatal error in case.cmpgen_namelists: 2025-07-09 12:08:01 atm
Create namelist for component datm
   Calling /glade/work/samrabin/ctsm_subsetdata-systemtest/components/cdeps/datm/cime_config/buildnml
  2025-07-09 12:08:02 lnd
Create namelist for component clm
   Calling /glade/work/samrabin/ctsm_subsetdata-systemtest/cime_config/buildnml
ERROR: Command /glade/work/samrabin/ctsm_subsetdata-systemtest/bld/build-namelist failed rc=255
out=
err=Attempt to call undefined import method with arguments ("main") via package "CLMBuildNamelist" (Perhaps you forgot to load the package?) at /glade/work/samrabin/ctsm_subsetdata-systemtest/bld/build-namelist line 21.
ERROR : CLM build-namelist::CLMBuildNamelist::add_default() : No default value found for fsurdat.
            Are defaults provided for this resolution and land mask?'
  File "/glade/work/samrabin/ctsm_subsetdata-systemtest/cime/CIME/test_scheduler.py", line 1126, in _run_catch_exceptions
    return run(test)
  File "/glade/work/samrabin/ctsm_subsetdata-systemtest/cime/CIME/test_scheduler.py", line 1012, in _setup_phase
    expect(
    ~~~~~~^
        cmdstat in [0, TESTS_FAILED_ERR_CODE],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "Fatal error in case.cmpgen_namelists: {}".format(output),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/glade/work/samrabin/ctsm_subsetdata-systemtest/cime/CIME/utils.py", line 176, in expect
    raise exc_type(msg)

I solved the problem by moving the call of case.cmpgen_namelists in test_scheduler from _setup_phase() to _sharedlib_build_phase(). So now TestStatus is all green:

PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default CREATE_NEWCASE
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default XML
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default SETUP
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default SHAREDLIB_BUILD time=280
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default MODEL_BUILD time=45
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default NLCOMP
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default SUBMIT
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default RUN time=321
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default GENERATE ssrtest
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default MEMLEAK
PASS SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default SHORT_TERM_ARCHIVER

Note the NLCOMP step coming after both BUILD steps, instead of after SHAREDLIB_BUILD like it usually does. I'm sure this will cause problems in the test suite because of that rearrangement; I'll plan to fix those if y'all are amenable to this solution. (I will also update to the latest CIME tag.)

Other solutions I considered:

  • Calling subset_data during the SUBSETDATAPOINT __init__() method. That didn't work, seemingly because no SUBSETDATAPOINT object is initialized until after the problematic point in test_scheduler.
  • Writing the necessary lines to user_nl_clm and whatever else during SUBSETDATAPOINT __init__(). Won't work for the same reason as above.

Other ideas I could try if this PR is rejected:

  • Doing this with a testdef rather than a SystemTest; not sure it would work.

Test suite:
Test baseline:
Test namelist changes:
Test status: [bit for bit, roundoff, climate changing]

Fixes [CIME Github issue #]: None

User interface changes?: None

Update gh-pages html (Y/N)?: No

Calling it in _setup_phase (as before) results in a namelist build failure for custom SystemTests that need to set some values that have no defaults.
@codecov
Copy link

codecov bot commented Jul 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@1006c39). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #4819   +/-   ##
=========================================
  Coverage          ?   55.40%           
=========================================
  Files             ?      266           
  Lines             ?    38449           
  Branches          ?     8307           
=========================================
  Hits              ?    21304           
  Misses            ?    14763           
  Partials          ?     2382           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@samsrabin
Copy link
Contributor Author

@jedwards4b Before I work on fixing the system tests, is this something you'd be amenable to bringing in? Or is there a better way you think I should handle this issue?

@jedwards4b
Copy link
Contributor

@samsrabin I think that this is okay as long as you can fix what it breaks.

@samsrabin
Copy link
Contributor Author

@jedwards4b I've fixed the failing system test, although I don't really understand how. Basically I needed to add back the call of case.cmpgen_namelists that I had moved elsewhere, keeping it in both places, but not checking the result in the first place.

@jedwards4b jedwards4b merged commit 6fb4e9e into ESMCI:master Jul 31, 2025
9 checks passed
@jedwards4b
Copy link
Contributor

@samsrabin @jgfouca writes: this PR appears to have broken some stuff for us (E3SM). We have tests that are showing NML diffs and regenerating them is not fixing the issue.

[T]he issue is that the nml bless/generate does not do the build phase, but the real run now does the nml later, so they won't ever match.

@samsrabin
Copy link
Contributor Author

Hmmm, now that you mention it, this might have caused similar issues in our CTSM tests when we first updated to cime6.1.112. @jgfouca, are there any meaningful namelist differences shown in the TestStatus.log files, or are they seemingly empty like this?

 ---------------------------------------------------
2025-08-11 17:23:24: NLCOMP

 ---------------------------------------------------

If it's just like that, it should just be a one-time thing—tags after the initial merge of cime6.1.112 did not have this issue.


For my reference:

  • We updated to cime6.1.112 in ctsm5.3.069.
  • Testing for that tag was at /glade/derecho/scratch/samrabin/tests_0811-170941de/.
  • get_test_nlcomp_section.sh showed seemingly empty NLCOMP diffs for all tests except the new ones.

@jgfouca
Copy link
Contributor

jgfouca commented Aug 27, 2025

@samsrabin , as an example for the case ERP_Ld3.ne4pg2_oQU480.F2010.mappy_gnu :

2025-08-26 15:13:24: NLCOMP
Comparison failed between '/home/jgfouca/acme/scratch/ERP_Ld3.ne4pg2_oQU480.F2010.mappy_gnu.C.20250826_150947_7o05wo/CaseDocs/drv_in' with '/sems-data-store/ACME/baselines/mappy/gnu/master/ERP_Ld3.ne4pg2_oQU480.F2010.mappy\
_gnu/CaseDocs/drv_in'
  BASE: restart_n = 3
  COMP: restart_n = 2

@samsrabin
Copy link
Contributor Author

Ah, well I guess that would have been too easy.

@jedwards4b I might need your help for this one; will message you separately.

@jedwards4b
Copy link
Contributor

I can help but I think that I need more clues. Can this problem be reproduced in any ERP test?

@jgfouca
Copy link
Contributor

jgfouca commented Aug 27, 2025

@jedwards4b , since ERP fiddles with pe settings that impact namelists, i think it should be reproducible for any ERP test (or PET, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants