fix(loss): exclude mixed_type padding atoms from the training loss#5738
fix(loss): exclude mixed_type padding atoms from the training loss#5738wanghan-iapcm wants to merge 10 commits into
Conversation
…sts can be excluded pt losses build model_pred internally and drop the model mask; recover it from atype in the TaskLoss base and call it from every pt loss forward; pt_expt already carries the mask; enables per-frame loss normalization in later tasks; no-op for non-mixed.
…e padding
Atomic terms (ados/acdf for DOSLoss; local tensor for TensorLoss) now use per-frame masked mean (idiom 1) instead of a cross-frame pooled masked mean, ensuring each frame is normalized by its own real-atom count. Global terms (dos/cdf for DOSLoss; global tensor for TensorLoss) now use plain mean (idiom 3), dropping the prior atom-count weighting that violated the grad-accumulation invariant. Both meet the invariant: a padded [3+5]-atom batch yields the same loss as processing each frame separately and averaging. Boolean fancy indexing removed from pt atomic branches. Changes are bit-identical for non-mixed batches (all-ones mask). Applies to deepmd/dpmodel/loss/{dos,tensor}.py (serves pt_expt) and deepmd/pt/loss/{dos,tensor}.py (mirror). TDD: invariant tests confirmed RED before and GREEN after (24/24 passed).
…padding Apply per-frame normalization to every loss term (energy, force, virial, atom_ener, atom_pref, gen_force) in both deepmd/dpmodel/loss/ener.py and deepmd/pt/loss/ener.py. When model_dict["mask"] is present (mixed_type batch), ghost-padded atoms are excluded from the loss signal via Idiom 1 (masked per-atom mean) for force/atom_ener/atom_pref and Idiom 2 (extensive inv**norm_exp weighting) for energy/virial; all sub-branches (mse/mae/huber, intensive/non-intensive) are covered. Non-mixed callers that never inject a mask hit the unchanged else-branch and are bit-identical to the pre-fix behavior. 40 new grad-accumulation-invariant unit tests (20 dpmodel + 20 pt) verify the fix and the no-op property.
…ay metric Add test_huber_grad_accum to TestDPModelEnergyLossAtomEnerGradAccum and TestPTEnergyLossAtomEnerGradAccum verifying the masked Huber atom_ener path satisfies the [3+5]==separate grad-accum invariant. Fix rmse_f in the masked force MSE branch to use the masked per-frame l2 (l2_force_masked / l2_f_masked) rather than the ghost-diluted unmasked l2_force_loss; hoist the masked MSE computation before the use_huber split so it is always in scope for the display metric in both huber and non-huber paths.
…ount In mixed_type batches, ghost atoms (atype=-1) pad short frames to a fixed nloc width. PropertyLoss previously divided pred/label by the scalar natoms (the padded count), causing frames with fewer real atoms to be incorrectly normalized. When model_dict["mask"] is present, use the per-frame real atom count (sum of mask along the atom axis, broadcast over task_dim) instead. The else branch retains the original /natoms behavior so mask-free callers are unchanged. Tests: RED→GREEN for mse/mae/smooth_mae grad-accum invariant in both dpmodel and pt; no-op and intensive guards added.
…type padding Apply the grad-accumulation-invariant per-frame normalization to EnergySpinLoss in both the dpmodel (deepmd/dpmodel/loss/ener_spin.py) and pt (deepmd/pt/loss/ener_spin.py) backends: - Energy (has_e, MSE): idiom 2 — per-frame 1/real_natoms^norm_exp normalization replaces the flat atom_norm**norm_exp * mean() formulation. - Force real (has_fr, MSE): idiom 1 — per-atom masked mean (ncomp=3) replaces the unmasked global mean over all nloc atoms. - Virial (has_v, MSE): idiom 2 — per-frame 1/real_natoms^norm_exp normalization (k=9) replaces atom_norm**norm_exp * mean(). The force_mag / mask_mag (has_fm) path is left completely untouched; mask_mag is the spin virtual-atom mask and is a distinct concept from the padding mask. Each masked branch is guarded by `if "mask" in model_dict/model_pred:` so mask-less callers (non-mixed batches without a padding mask) use the original code path unchanged — the fix is a bit-identical no-op for non-mixed batches. Tests added to source/tests/common/dpmodel/test_loss_padding.py and source/tests/pt/test_loss_padding.py: invariant tests for energy (norm_exp 1 and 2), force_real, and virial; no-op tests for each term; and a guard that confirms force_mag loss is numerically unchanged when a padding mask is present.
… padding Energy/force_real/virial MAE branches in EnergySpinLoss (dpmodel and pt) used padded-scalar natoms or unmasked global means, violating the grad-accumulation invariant for mixed_type batches. Apply the same per-frame masked pattern already used for MSE, mirroring the authoritative treatment in ener.py (Task 3). Also fix the mae=True display metric for energy to use per-frame inv weighting instead of padded atom_norm. force_mag/mask_mag and MSE branches are untouched.
The extensive property mask normalization called xp.sum(model_dict["mask"], -1) with axis positional; array_api_compat's torch namespace declares axis keyword-only, so this raised TypeError for every pt_expt extensive-property training. Use axis=-1, matching the dos/tensor/ener losses. Adds a torch-tensor test through PropertyLoss.call that reproduces the crash on the old form.
…pe padding Apply idiom 1 (per-atom masked mean, ncomp=1) to the has_ae block in both EnergySpinLoss backends so that ghost-padded atoms (mask=0) are excluded from the atom_ener loss normalization. Before this fix the loss was computed as mean(square(diff)) over a flattened [nf*nloc] vector, so ghost atoms diluted the denominator in mixed_type batches. The masked path uses xp.reshape(maskf, (_nf, _nloc, 1)) as a broadcastable column mask, then idiom-1 per-frame sum / dof → mean over frames. The unmasked (no-mask) fallback is unchanged. Also add grad-accum invariant tests for: ener_spin.has_ae (mse/mae, RED→GREEN), gen_force/has_gf (already GREEN — ghost forces are masked before projection so the mean(square) over [nf, ngen] is frame-decomposable), and force_mag MSE (GREEN — n_valid excludes ghost atoms via mask_mag=0; invariant holds when NM is equal across frames, which is the practical spin-model case). force_mag MAE uses xp.sum over frames instead of xp.mean (2x accumulation artifact, pre-existing, independent of mask_mag semantics) and is reported as NEEDS_CONTEXT.
Importing deepmd.pt installs a cuda:9999999 default-device trap, so the property torch test failed only when run after the pt test file (order-dependent). Pin device=cpu on the constructed tensors to make the test hermetic. Source behavior unchanged.
| _nloc = maskf.shape[1] | ||
| else: | ||
| maskf = None | ||
| inv = None |
| else: | ||
| maskf = None | ||
| inv = None | ||
| _nf = None |
| maskf = None | ||
| inv = None | ||
| _nf = None | ||
| _nloc = None |
| _nloc = maskf.shape[1] | ||
| else: | ||
| maskf = None | ||
| inv = None |
| else: | ||
| maskf = None | ||
| inv = None | ||
| _nf = None |
| else: | ||
| maskf = None | ||
| inv = None | ||
| _nf = None |
| maskf = None | ||
| inv = None | ||
| _nf = None | ||
| _nloc = None |
| _nloc = maskf.shape[1] | ||
| else: | ||
| maskf = None | ||
| inv = None |
| else: | ||
| maskf = None | ||
| inv = None | ||
| _nf = None |
| maskf = None | ||
| inv = None | ||
| _nf = None | ||
| _nloc = None |
📝 WalkthroughWalkthroughLoss functions across dpmodel and pt backends (DOS, energy, energy-spin, tensor, property) are refactored to support per-frame atom masking for padded/mixed-type batches. A new pt helper injects atom masks from atype, local losses use per-frame masked reductions, global losses use unweighted means with mask-derived atom counts, and extensive invariant tests are added for both backends. ChangesMask-aware loss normalization
Estimated code review effort: 5 (Critical) | ~120 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
source/tests/pt/test_loss_padding.py (2)
1976-2064: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKnown force_mag MAE bug documented but left untested.
The comment acknowledges that
force_magMAE fails the grad-accum invariant by a 2x factor due to frame-sum normalization, but only the MSE variant is exercised as a test — there's noxfail/skiptest capturing the MAE case, so this documented gap indeepmd/pt/loss/ener_spin.pyisn't tracked by CI and could silently regress or be forgotten.Consider adding an explicit
@pytest.mark.xfailtest asserting the known MAE force_mag discrepancy, so the fix (or its continued absence) is tracked rather than only documented in a comment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt/test_loss_padding.py` around lines 1976 - 2064, The force_mag MAE grad-accum discrepancy is only described in comments and not covered by CI. Add an explicit pytest xfail (or skip) test near TestPTEnerSpinLossForceMagMSEGradAccum that exercises the MAE path through _spin_loss_fn/EnergySpinLossPT and asserts the known 2x invariant mismatch, so the documented behavior is tracked and won’t be forgotten. Keep the existing MSE test unchanged and use the same make_A, make_B, and make_padded helpers to locate the relevant loss behavior.
109-122: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate mock model classes.
_MockModeland_EnerLossMockModelare functionally identical (same__init__/__call__). Consider consolidating into a single shared helper used across all test classes in this file.♻️ Suggested consolidation
-class _MockModel: - """Callable that ignores inputs and returns a fixed model_pred dict. - ... - """ - - def __init__(self, pred: dict): - self._pred = pred - - def __call__(self, **kwargs): - return dict(self._pred) - ... -class _EnerLossMockModel: - """Callable returning a fixed model_pred dict (mask pre-populated).""" - - def __init__(self, pred: dict): - self._pred = pred - - def __call__(self, **kwargs): - return dict(self._pred) +# Reuse the single _MockModel class defined earlier in this file instead of +# redefining an identical class for the energy-loss test section.Also applies to: 628-636
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt/test_loss_padding.py` around lines 109 - 122, Consolidate the duplicate test helpers by removing the redundant `_EnerLossMockModel` and reusing `_MockModel` across the pt loss tests. Keep the shared behavior in one helper with the same `__init__` and `__call__` contract, and update the affected test classes in `test_loss_padding` to instantiate that single helper instead of maintaining two identical mock model classes. Ensure any existing expectations around the shallow copy behavior and pre-populated `pred` mask remain unchanged.deepmd/dpmodel/loss/dos.py (1)
133-143: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffOptional: extract the repeated masked per-frame reduction.
This masked "per-frame sum / per-frame dof → mean" idiom is duplicated verbatim in the local CDF block (Lines 160-170) and mirrored in
tensor.pyand both pt backends. A small shared helper (e.g.masked_per_frame_mse(diff3d, maskf, ndof)) would reduce the maintenance surface. Behavior-preserving; safe to defer.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/dpmodel/loss/dos.py` around lines 133 - 143, The masked per-frame reduction in the DOS loss is duplicated across the local DOS and local CDF paths and mirrored in other backends, so extract it into a shared helper to reduce maintenance. Create a small utility such as masked_per_frame_mse(diff3d, maskf, ndof) and have the DOS block in dos.py call it instead of inlining the "per-frame sum / per-frame dof → mean" logic. Keep behavior identical and reuse the helper wherever the same masked reduction pattern appears, including the local CDF block and matching implementations in tensor.py and the pt backends.source/tests/common/dpmodel/test_loss_padding.py (1)
204-207: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRedundant
acdf/cdftests add no independent coverage.
test_acdf_grad_accum_invariantsimply re-invokestest_ados_grad_accum_invariant(andtest_cdf_grad_accum_invariantat Lines 282-285 re-invokestest_dos_grad_accum_invariant). Theacdf/cdfpaths are only exercised incidentally because_make_losssets both*_ados/*_acdf(and*_dos/*_cdf) prefs to1.0, so these named tests don't isolate the cumulative-distribution path. Consider either isolating the cdf term (e.g., pref only on cdf/acdf) or dropping the alias tests to avoid implying coverage that isn't independent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/common/dpmodel/test_loss_padding.py` around lines 204 - 207, The acdf/cdf grad-accum tests are just aliases of the ados/dos tests and don’t add independent coverage. Update the relevant test methods in test_loss_padding (test_acdf_grad_accum_invariant and test_cdf_grad_accum_invariant) so they either isolate the cumulative-distribution path by configuring _make_loss prefs for only acdf/cdf, or remove the redundant alias tests altogether. Use the existing test_ados_grad_accum_invariant and test_dos_grad_accum_invariant helpers as the reference point when deciding whether to separate coverage or drop the wrappers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/tests/common/dpmodel/test_loss_padding.py`:
- Around line 2011-2019: The `force_mag` MAE path in `ener_spin.py` is still
frame-scaled because it uses a sum over per-frame averages instead of a true
frame-wise mean. Update the MAE computation in the `force_mag` loss logic to
normalize across frames consistently with the other `ener_spin` MAE branches,
using the relevant symbols around the `force_mag` reduction code path. If this
change is intentionally deferred, add a reference to the tracking issue in the
surrounding test/comment; otherwise make the loss independent of `nf` so the 2x
effect disappears.
---
Nitpick comments:
In `@deepmd/dpmodel/loss/dos.py`:
- Around line 133-143: The masked per-frame reduction in the DOS loss is
duplicated across the local DOS and local CDF paths and mirrored in other
backends, so extract it into a shared helper to reduce maintenance. Create a
small utility such as masked_per_frame_mse(diff3d, maskf, ndof) and have the DOS
block in dos.py call it instead of inlining the "per-frame sum / per-frame dof →
mean" logic. Keep behavior identical and reuse the helper wherever the same
masked reduction pattern appears, including the local CDF block and matching
implementations in tensor.py and the pt backends.
In `@source/tests/common/dpmodel/test_loss_padding.py`:
- Around line 204-207: The acdf/cdf grad-accum tests are just aliases of the
ados/dos tests and don’t add independent coverage. Update the relevant test
methods in test_loss_padding (test_acdf_grad_accum_invariant and
test_cdf_grad_accum_invariant) so they either isolate the
cumulative-distribution path by configuring _make_loss prefs for only acdf/cdf,
or remove the redundant alias tests altogether. Use the existing
test_ados_grad_accum_invariant and test_dos_grad_accum_invariant helpers as the
reference point when deciding whether to separate coverage or drop the wrappers.
In `@source/tests/pt/test_loss_padding.py`:
- Around line 1976-2064: The force_mag MAE grad-accum discrepancy is only
described in comments and not covered by CI. Add an explicit pytest xfail (or
skip) test near TestPTEnerSpinLossForceMagMSEGradAccum that exercises the MAE
path through _spin_loss_fn/EnergySpinLossPT and asserts the known 2x invariant
mismatch, so the documented behavior is tracked and won’t be forgotten. Keep the
existing MSE test unchanged and use the same make_A, make_B, and make_padded
helpers to locate the relevant loss behavior.
- Around line 109-122: Consolidate the duplicate test helpers by removing the
redundant `_EnerLossMockModel` and reusing `_MockModel` across the pt loss
tests. Keep the shared behavior in one helper with the same `__init__` and
`__call__` contract, and update the affected test classes in `test_loss_padding`
to instantiate that single helper instead of maintaining two identical mock
model classes. Ensure any existing expectations around the shallow copy behavior
and pre-populated `pred` mask remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3120253d-2b09-40b1-9819-187f97feab86
📒 Files selected for processing (13)
deepmd/dpmodel/loss/dos.pydeepmd/dpmodel/loss/ener.pydeepmd/dpmodel/loss/ener_spin.pydeepmd/dpmodel/loss/property.pydeepmd/dpmodel/loss/tensor.pydeepmd/pt/loss/dos.pydeepmd/pt/loss/ener.pydeepmd/pt/loss/ener_spin.pydeepmd/pt/loss/loss.pydeepmd/pt/loss/property.pydeepmd/pt/loss/tensor.pysource/tests/common/dpmodel/test_loss_padding.pysource/tests/pt/test_loss_padding.py
| # --------------------------------------------------------------------------- | ||
| # force_mag MSE grad-accum invariant (NM equal across frames, ghost atoms non-magnetic) | ||
| # | ||
| # NOTE: force_mag MAE (dpmodel) uses xp.sum over frames instead of xp.mean, | ||
| # so MAE force_mag FAILS the invariant with a 2x factor when frames=2. | ||
| # This is a pre-existing frame-normalization artifact independent of ghost-atom | ||
| # masking. Ghost atoms are correctly excluded (mask_mag=0 there). The MAE | ||
| # artifact is reported as NEEDS_CONTEXT in the audit report. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect force_mag MAE normalization in the dpmodel spin loss.
fd -t f 'ener_spin.py' deepmd/dpmodel/loss --exec ast-grep outline {}
rg -nP -C4 'force_mag|mask_mag' deepmd/dpmodel/loss/ener_spin.pyRepository: deepmodeling/deepmd-kit
Length of output: 2756
Normalize force_mag MAE across frames. deepmd/dpmodel/loss/ener_spin.py still computes the MAE term as xp.sum(per_frame_sum_fm / per_frame_count_fm), so the loss scales with the number of frames (the 2× behavior for nf=2 remains). If this is out of scope here, link the tracking issue; otherwise switch it to a frame-wise mean to match the other ener_spin MAE paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@source/tests/common/dpmodel/test_loss_padding.py` around lines 2011 - 2019,
The `force_mag` MAE path in `ener_spin.py` is still frame-scaled because it uses
a sum over per-frame averages instead of a true frame-wise mean. Update the MAE
computation in the `force_mag` loss logic to normalize across frames
consistently with the other `ener_spin` MAE branches, using the relevant symbols
around the `force_mag` reduction code path. If this change is intentionally
deferred, add a reference to the tracking issue in the surrounding test/comment;
otherwise make the loss independent of `nf` so the 2x effect disappears.
|
Thanks for the careful fix. The important invariant here should indeed be frame-wise equivalence: a padded mixed-type batch should give the same loss/gradient as processing the real frames separately and averaging. This PR consistently applies that criterion across the pt/pt_expt loss paths, and the all-ones-mask guards are a good way to protect non-mixed behavior. A few notes before merging:
No blocker from my side on the main masking/normalization approach once CI is green and the follow-up tracking is clear. Authored by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5) |
njzjz
left a comment
There was a problem hiding this comment.
Approve but see above non-blocking comments.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5738 +/- ##
==========================================
- Coverage 81.28% 81.22% -0.06%
==========================================
Files 988 990 +2
Lines 110891 111535 +644
Branches 4234 4232 -2
==========================================
+ Hits 90137 90595 +458
- Misses 19229 19412 +183
- Partials 1525 1528 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Problem
In the
mixed_typedata format, short frames are padded withtype = -1ghost atoms up to a fixednloc, and the real atom count varies per frame within a batch. The training loss normalized by the padded scalarnatomsand took unmasked or cross-frame-pooled means, so ghost atoms diluted the force/atomic denominators and mis-normalized the extensive energy/virial/property terms. As a result a padded[3-atom + 5-atom]batch did not produce the same loss/gradient as processing the 3-atom and 5-atom frames separately. Onlymixed_typebatches are affected; non-mixed training was already exact.Fix
The per-atom mask (
atype >= 0) now reaches the loss under the existingmodel_dict["mask"]convention: pt_expt already propagates it; the pt (torch.jit) backend recovers it fromatypevia a newTaskLoss._inject_atom_maskhelper called from every pt lossforward(the exported forward drops the model's per-atom mask, so it is recovered training-side only — the exported artifact is untouched).Every loss term is then normalized per frame so that a padded batch equals the grad-accumulation of the individual frames at their real sizes: per-atom terms (force, atom_ener, atom_pref, atomic dos/tensor, spin real-force, generalized force) use a per-frame masked mean; extensive terms (energy, virial, property) divide by the per-frame real atom count; global already-reduced terms (global dos/tensor) use a plain mean with the previous atom-count weighting dropped. Every change reduces exactly to the previous formula when the mask is all-ones, so non-mixed training is numerically identical (no-op to rounding). Covered across five shared loss types in both backends:
deepmd/dpmodel/loss/{ener,ener_spin,dos,tensor,property}.py(which serve pt_expt) anddeepmd/pt/loss/{ener,ener_spin,dos,tensor,property}.py.Two additional fixes surfaced during the work: the extensive property normalization called
xp.sum(mask, -1)with a positional axis, which raisesTypeErrorunder the array_api_compat torch namespace (every pt_expt extensive-property run) — nowaxis=-1; andener_spin's MAE energy and real-force terms were pre-existingly inconsistent withener.py(they summed over frames without per-atom normalization) — they are now aligned withener.py, which changes their non-mixed MAE loss values (a deliberate bug fix, see Known Limitations).Test
New
source/tests/common/dpmodel/test_loss_padding.pyandsource/tests/pt/test_loss_padding.pyassert, for every per-atom and extensive term of all five loss types in both backends, that a padded[3+5]batch loss equals the mean of the two frames processed separately, plus an all-ones-mask non-mixed no-op guard per term, and a torch-tensor path through the dpmodel property loss (which reproduces the positional-axis crash on the old form). An audit added invariant coverage for the generalized-force and spin magnetic-force terms and confirmed they are free of padding artifacts.Known limitations
The tf backend loss is unchanged and retains the same mixed_type behavior (follow-up). The pt-only losses
dens,population,denoiseare not covered (follow-up).ener_spin's magnetic-force (force_mag) MAE term uses a sum over frames rather than a mean, so it does not satisfy the frame-average invariant — this is not a padding artifact (ghost atoms are correctly excluded viamask_mag), but a separate pre-existing MAE frame-normalization inconsistency, left for a follow-up decision. Theenable_atom_ener_coeffpath sums ghost atomic energies before the energy reduction (pre-existing; ghost atom_ener is ~0 by convention). Existingmixed_typetrainings will not reproduce numerically — the new values are the correct ones. Ghost label forces are assumed ~0 by the dpdata convention; the mask makes the loss robust even if they are not.Summary by CodeRabbit
New Features
Bug Fixes