Skip to content

fix(loss): exclude mixed_type padding atoms from the training loss#5738

Open
wanghan-iapcm wants to merge 10 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-loss-mixed-type-padding
Open

fix(loss): exclude mixed_type padding atoms from the training loss#5738
wanghan-iapcm wants to merge 10 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-loss-mixed-type-padding

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

Problem

In the mixed_type data format, short frames are padded with type = -1 ghost atoms up to a fixed nloc, and the real atom count varies per frame within a batch. The training loss normalized by the padded scalar natoms and took unmasked or cross-frame-pooled means, so ghost atoms diluted the force/atomic denominators and mis-normalized the extensive energy/virial/property terms. As a result a padded [3-atom + 5-atom] batch did not produce the same loss/gradient as processing the 3-atom and 5-atom frames separately. Only mixed_type batches are affected; non-mixed training was already exact.

Fix

The per-atom mask (atype >= 0) now reaches the loss under the existing model_dict["mask"] convention: pt_expt already propagates it; the pt (torch.jit) backend recovers it from atype via a new TaskLoss._inject_atom_mask helper called from every pt loss forward (the exported forward drops the model's per-atom mask, so it is recovered training-side only — the exported artifact is untouched).

Every loss term is then normalized per frame so that a padded batch equals the grad-accumulation of the individual frames at their real sizes: per-atom terms (force, atom_ener, atom_pref, atomic dos/tensor, spin real-force, generalized force) use a per-frame masked mean; extensive terms (energy, virial, property) divide by the per-frame real atom count; global already-reduced terms (global dos/tensor) use a plain mean with the previous atom-count weighting dropped. Every change reduces exactly to the previous formula when the mask is all-ones, so non-mixed training is numerically identical (no-op to rounding). Covered across five shared loss types in both backends: deepmd/dpmodel/loss/{ener,ener_spin,dos,tensor,property}.py (which serve pt_expt) and deepmd/pt/loss/{ener,ener_spin,dos,tensor,property}.py.

Two additional fixes surfaced during the work: the extensive property normalization called xp.sum(mask, -1) with a positional axis, which raises TypeError under the array_api_compat torch namespace (every pt_expt extensive-property run) — now axis=-1; and ener_spin's MAE energy and real-force terms were pre-existingly inconsistent with ener.py (they summed over frames without per-atom normalization) — they are now aligned with ener.py, which changes their non-mixed MAE loss values (a deliberate bug fix, see Known Limitations).

Test

New source/tests/common/dpmodel/test_loss_padding.py and source/tests/pt/test_loss_padding.py assert, for every per-atom and extensive term of all five loss types in both backends, that a padded [3+5] batch loss equals the mean of the two frames processed separately, plus an all-ones-mask non-mixed no-op guard per term, and a torch-tensor path through the dpmodel property loss (which reproduces the positional-axis crash on the old form). An audit added invariant coverage for the generalized-force and spin magnetic-force terms and confirmed they are free of padding artifacts.

Known limitations

The tf backend loss is unchanged and retains the same mixed_type behavior (follow-up). The pt-only losses dens, population, denoise are not covered (follow-up). ener_spin's magnetic-force (force_mag) MAE term uses a sum over frames rather than a mean, so it does not satisfy the frame-average invariant — this is not a padding artifact (ghost atoms are correctly excluded via mask_mag), but a separate pre-existing MAE frame-normalization inconsistency, left for a follow-up decision. The enable_atom_ener_coeff path sums ghost atomic energies before the energy reduction (pre-existing; ghost atom_ener is ~0 by convention). Existing mixed_type trainings will not reproduce numerically — the new values are the correct ones. Ghost label forces are assumed ~0 by the dpdata convention; the mask makes the loss robust even if they are not.

Summary by CodeRabbit

  • New Features

    • Improved support for padded, mixed-atom batches across loss calculations.
    • Added more consistent handling of masked frames for energy, force, tensor, DOS, property, and spin-related losses.
  • Bug Fixes

    • Fixed loss normalization so ghost atoms and padded entries no longer skew results.
    • Made global loss metrics and reported RMSE/MAE values more consistent under masking.
    • Preserved intensive-property behavior while improving extensive-property scaling.

Han Wang added 10 commits July 4, 2026 19:03
…sts can be excluded

pt losses build model_pred internally and drop the model mask; recover it from atype in the TaskLoss base and call it from every pt loss forward; pt_expt already carries the mask; enables per-frame loss normalization in later tasks; no-op for non-mixed.
…e padding

Atomic terms (ados/acdf for DOSLoss; local tensor for TensorLoss) now use per-frame masked mean (idiom 1) instead of a cross-frame pooled masked mean, ensuring each frame is normalized by its own real-atom count. Global terms (dos/cdf for DOSLoss; global tensor for TensorLoss) now use plain mean (idiom 3), dropping the prior atom-count weighting that violated the grad-accumulation invariant. Both meet the invariant: a padded [3+5]-atom batch yields the same loss as processing each frame separately and averaging. Boolean fancy indexing removed from pt atomic branches. Changes are bit-identical for non-mixed batches (all-ones mask). Applies to deepmd/dpmodel/loss/{dos,tensor}.py (serves pt_expt) and deepmd/pt/loss/{dos,tensor}.py (mirror). TDD: invariant tests confirmed RED before and GREEN after (24/24 passed).
…padding

Apply per-frame normalization to every loss term (energy, force, virial,
atom_ener, atom_pref, gen_force) in both deepmd/dpmodel/loss/ener.py and
deepmd/pt/loss/ener.py. When model_dict["mask"] is present (mixed_type
batch), ghost-padded atoms are excluded from the loss signal via Idiom 1
(masked per-atom mean) for force/atom_ener/atom_pref and Idiom 2
(extensive inv**norm_exp weighting) for energy/virial; all sub-branches
(mse/mae/huber, intensive/non-intensive) are covered. Non-mixed callers
that never inject a mask hit the unchanged else-branch and are bit-identical
to the pre-fix behavior. 40 new grad-accumulation-invariant unit tests
(20 dpmodel + 20 pt) verify the fix and the no-op property.
…ay metric

Add test_huber_grad_accum to TestDPModelEnergyLossAtomEnerGradAccum and TestPTEnergyLossAtomEnerGradAccum verifying the masked Huber atom_ener path satisfies the [3+5]==separate grad-accum invariant. Fix rmse_f in the masked force MSE branch to use the masked per-frame l2 (l2_force_masked / l2_f_masked) rather than the ghost-diluted unmasked l2_force_loss; hoist the masked MSE computation before the use_huber split so it is always in scope for the display metric in both huber and non-huber paths.
…ount

In mixed_type batches, ghost atoms (atype=-1) pad short frames to a fixed
nloc width. PropertyLoss previously divided pred/label by the scalar natoms
(the padded count), causing frames with fewer real atoms to be incorrectly
normalized. When model_dict["mask"] is present, use the per-frame real atom
count (sum of mask along the atom axis, broadcast over task_dim) instead.
The else branch retains the original /natoms behavior so mask-free callers
are unchanged. Tests: RED→GREEN for mse/mae/smooth_mae grad-accum invariant
in both dpmodel and pt; no-op and intensive guards added.
…type padding

Apply the grad-accumulation-invariant per-frame normalization to EnergySpinLoss
in both the dpmodel (deepmd/dpmodel/loss/ener_spin.py) and pt
(deepmd/pt/loss/ener_spin.py) backends:

- Energy (has_e, MSE): idiom 2 — per-frame 1/real_natoms^norm_exp normalization
  replaces the flat atom_norm**norm_exp * mean() formulation.
- Force real (has_fr, MSE): idiom 1 — per-atom masked mean (ncomp=3) replaces
  the unmasked global mean over all nloc atoms.
- Virial (has_v, MSE): idiom 2 — per-frame 1/real_natoms^norm_exp normalization
  (k=9) replaces atom_norm**norm_exp * mean().

The force_mag / mask_mag (has_fm) path is left completely untouched; mask_mag
is the spin virtual-atom mask and is a distinct concept from the padding mask.

Each masked branch is guarded by `if "mask" in model_dict/model_pred:` so
mask-less callers (non-mixed batches without a padding mask) use the original
code path unchanged — the fix is a bit-identical no-op for non-mixed batches.

Tests added to source/tests/common/dpmodel/test_loss_padding.py and
source/tests/pt/test_loss_padding.py: invariant tests for energy (norm_exp 1
and 2), force_real, and virial; no-op tests for each term; and a guard that
confirms force_mag loss is numerically unchanged when a padding mask is present.
… padding

Energy/force_real/virial MAE branches in EnergySpinLoss (dpmodel and pt) used padded-scalar natoms or unmasked global means, violating the grad-accumulation invariant for mixed_type batches. Apply the same per-frame masked pattern already used for MSE, mirroring the authoritative treatment in ener.py (Task 3). Also fix the mae=True display metric for energy to use per-frame inv weighting instead of padded atom_norm. force_mag/mask_mag and MSE branches are untouched.
The extensive property mask normalization called xp.sum(model_dict["mask"], -1) with axis positional; array_api_compat's torch namespace declares axis keyword-only, so this raised TypeError for every pt_expt extensive-property training. Use axis=-1, matching the dos/tensor/ener losses. Adds a torch-tensor test through PropertyLoss.call that reproduces the crash on the old form.
…pe padding

Apply idiom 1 (per-atom masked mean, ncomp=1) to the has_ae block in both
EnergySpinLoss backends so that ghost-padded atoms (mask=0) are excluded from
the atom_ener loss normalization. Before this fix the loss was computed as
mean(square(diff)) over a flattened [nf*nloc] vector, so ghost atoms diluted
the denominator in mixed_type batches.

The masked path uses xp.reshape(maskf, (_nf, _nloc, 1)) as a broadcastable
column mask, then idiom-1 per-frame sum / dof → mean over frames. The unmasked
(no-mask) fallback is unchanged.

Also add grad-accum invariant tests for: ener_spin.has_ae (mse/mae, RED→GREEN),
gen_force/has_gf (already GREEN — ghost forces are masked before projection so
the mean(square) over [nf, ngen] is frame-decomposable), and force_mag MSE
(GREEN — n_valid excludes ghost atoms via mask_mag=0; invariant holds when NM
is equal across frames, which is the practical spin-model case). force_mag MAE
uses xp.sum over frames instead of xp.mean (2x accumulation artifact, pre-existing,
independent of mask_mag semantics) and is reported as NEEDS_CONTEXT.
Importing deepmd.pt installs a cuda:9999999 default-device trap, so the property torch test failed only when run after the pt test file (order-dependent). Pin device=cpu on the constructed tensors to make the test hermetic. Source behavior unchanged.
@dosubot dosubot Bot added the bug label Jul 5, 2026
@github-actions github-actions Bot added the Python label Jul 5, 2026
@wanghan-iapcm wanghan-iapcm requested a review from njzjz July 5, 2026 16:13
_nloc = maskf.shape[1]
else:
maskf = None
inv = None
else:
maskf = None
inv = None
_nf = None
maskf = None
inv = None
_nf = None
_nloc = None
_nloc = maskf.shape[1]
else:
maskf = None
inv = None
else:
maskf = None
inv = None
_nf = None
Comment thread deepmd/pt/loss/ener.py
else:
maskf = None
inv = None
_nf = None
Comment thread deepmd/pt/loss/ener.py
maskf = None
inv = None
_nf = None
_nloc = None
_nloc = maskf.shape[1]
else:
maskf = None
inv = None
else:
maskf = None
inv = None
_nf = None
maskf = None
inv = None
_nf = None
_nloc = None
@coderabbitai

coderabbitai Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Loss functions across dpmodel and pt backends (DOS, energy, energy-spin, tensor, property) are refactored to support per-frame atom masking for padded/mixed-type batches. A new pt helper injects atom masks from atype, local losses use per-frame masked reductions, global losses use unweighted means with mask-derived atom counts, and extensive invariant tests are added for both backends.

Changes

Mask-aware loss normalization

Layer / File(s) Summary
Atom mask injection helper
deepmd/pt/loss/loss.py
Adds TaskLoss._inject_atom_mask, reconstructing a per-atom mask from atype (ghost atoms → 0) when absent from predictions.
DOS loss masked normalization
deepmd/dpmodel/loss/dos.py, deepmd/pt/loss/dos.py
Local DOS/CDF use per-frame masked squared-error reductions normalized by masked degrees of freedom; global DOS/CDF use unweighted mean squared error with mask-derived atom_num for RMSE scaling; pt forward wraps model output with _inject_atom_mask.
Tensor loss masked normalization
deepmd/dpmodel/loss/tensor.py, deepmd/pt/loss/tensor.py
Local tensor loss reshapes into per-frame masked mean; global tensor loss becomes a plain mean of squared error with mask used only to derive atom_num.
Energy loss masked normalization
deepmd/dpmodel/loss/ener.py, deepmd/pt/loss/ener.py
Energy, force, virial, atomic energy, atomic-pref, and generalized-force terms branch on a per-frame mask for MSE/MAE/Huber modes; adds custom_huber_loss helper.
EnergySpin loss masked normalization
deepmd/dpmodel/loss/ener_spin.py, deepmd/pt/loss/ener_spin.py
Energy, real-force, atom-energy, and virial terms gain masked/unmasked branches with per-frame normalization by real atom counts.
Property loss masked normalization
deepmd/dpmodel/loss/property.py, deepmd/pt/loss/property.py
Non-intensive property normalization divides by per-frame real atom counts derived from a mask when present, else falls back to natoms.
Grad-accumulation invariant tests
source/tests/common/dpmodel/test_loss_padding.py, source/tests/pt/test_loss_padding.py
Adds shared padded-batch invariant harnesses and comprehensive test coverage across DOS, tensor, energy, energy-spin, and property losses for both backends, plus unit tests for _inject_atom_mask.

Estimated code review effort: 5 (Critical) | ~120 minutes

Possibly related PRs

  • deepmodeling/deepmd-kit#5156: Both PRs modify the spin-energy virial loss computation in deepmd/pt/loss/ener_spin.py.
  • deepmodeling/deepmd-kit#5250: The retrieved PR adds/aligns mask propagation in model outputs that this PR's losses now consume for masked normalization.
  • deepmodeling/deepmd-kit#5345: Both PRs touch the same loss implementations (dos.py, ener.py, ener_spin.py, tensor.py, property.py), with #5345 introducing the losses this PR later refactors.

Suggested reviewers: njzjz, iProzd

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: excluding mixed_type padding atoms from training loss calculations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
source/tests/pt/test_loss_padding.py (2)

1976-2064: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Known force_mag MAE bug documented but left untested.

The comment acknowledges that force_mag MAE fails the grad-accum invariant by a 2x factor due to frame-sum normalization, but only the MSE variant is exercised as a test — there's no xfail/skip test capturing the MAE case, so this documented gap in deepmd/pt/loss/ener_spin.py isn't tracked by CI and could silently regress or be forgotten.

Consider adding an explicit @pytest.mark.xfail test asserting the known MAE force_mag discrepancy, so the fix (or its continued absence) is tracked rather than only documented in a comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt/test_loss_padding.py` around lines 1976 - 2064, The force_mag
MAE grad-accum discrepancy is only described in comments and not covered by CI.
Add an explicit pytest xfail (or skip) test near
TestPTEnerSpinLossForceMagMSEGradAccum that exercises the MAE path through
_spin_loss_fn/EnergySpinLossPT and asserts the known 2x invariant mismatch, so
the documented behavior is tracked and won’t be forgotten. Keep the existing MSE
test unchanged and use the same make_A, make_B, and make_padded helpers to
locate the relevant loss behavior.

109-122: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate mock model classes.

_MockModel and _EnerLossMockModel are functionally identical (same __init__/__call__). Consider consolidating into a single shared helper used across all test classes in this file.

♻️ Suggested consolidation
-class _MockModel:
-    """Callable that ignores inputs and returns a fixed model_pred dict.
-    ...
-    """
-
-    def __init__(self, pred: dict):
-        self._pred = pred
-
-    def __call__(self, **kwargs):
-        return dict(self._pred)
-
...
-class _EnerLossMockModel:
-    """Callable returning a fixed model_pred dict (mask pre-populated)."""
-
-    def __init__(self, pred: dict):
-        self._pred = pred
-
-    def __call__(self, **kwargs):
-        return dict(self._pred)
+# Reuse the single _MockModel class defined earlier in this file instead of
+# redefining an identical class for the energy-loss test section.

Also applies to: 628-636

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt/test_loss_padding.py` around lines 109 - 122, Consolidate the
duplicate test helpers by removing the redundant `_EnerLossMockModel` and
reusing `_MockModel` across the pt loss tests. Keep the shared behavior in one
helper with the same `__init__` and `__call__` contract, and update the affected
test classes in `test_loss_padding` to instantiate that single helper instead of
maintaining two identical mock model classes. Ensure any existing expectations
around the shallow copy behavior and pre-populated `pred` mask remain unchanged.
deepmd/dpmodel/loss/dos.py (1)

133-143: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

Optional: extract the repeated masked per-frame reduction.

This masked "per-frame sum / per-frame dof → mean" idiom is duplicated verbatim in the local CDF block (Lines 160-170) and mirrored in tensor.py and both pt backends. A small shared helper (e.g. masked_per_frame_mse(diff3d, maskf, ndof)) would reduce the maintenance surface. Behavior-preserving; safe to defer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/dpmodel/loss/dos.py` around lines 133 - 143, The masked per-frame
reduction in the DOS loss is duplicated across the local DOS and local CDF paths
and mirrored in other backends, so extract it into a shared helper to reduce
maintenance. Create a small utility such as masked_per_frame_mse(diff3d, maskf,
ndof) and have the DOS block in dos.py call it instead of inlining the
"per-frame sum / per-frame dof → mean" logic. Keep behavior identical and reuse
the helper wherever the same masked reduction pattern appears, including the
local CDF block and matching implementations in tensor.py and the pt backends.
source/tests/common/dpmodel/test_loss_padding.py (1)

204-207: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant acdf/cdf tests add no independent coverage.

test_acdf_grad_accum_invariant simply re-invokes test_ados_grad_accum_invariant (and test_cdf_grad_accum_invariant at Lines 282-285 re-invokes test_dos_grad_accum_invariant). The acdf/cdf paths are only exercised incidentally because _make_loss sets both *_ados/*_acdf (and *_dos/*_cdf) prefs to 1.0, so these named tests don't isolate the cumulative-distribution path. Consider either isolating the cdf term (e.g., pref only on cdf/acdf) or dropping the alias tests to avoid implying coverage that isn't independent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/common/dpmodel/test_loss_padding.py` around lines 204 - 207, The
acdf/cdf grad-accum tests are just aliases of the ados/dos tests and don’t add
independent coverage. Update the relevant test methods in test_loss_padding
(test_acdf_grad_accum_invariant and test_cdf_grad_accum_invariant) so they
either isolate the cumulative-distribution path by configuring _make_loss prefs
for only acdf/cdf, or remove the redundant alias tests altogether. Use the
existing test_ados_grad_accum_invariant and test_dos_grad_accum_invariant
helpers as the reference point when deciding whether to separate coverage or
drop the wrappers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/common/dpmodel/test_loss_padding.py`:
- Around line 2011-2019: The `force_mag` MAE path in `ener_spin.py` is still
frame-scaled because it uses a sum over per-frame averages instead of a true
frame-wise mean. Update the MAE computation in the `force_mag` loss logic to
normalize across frames consistently with the other `ener_spin` MAE branches,
using the relevant symbols around the `force_mag` reduction code path. If this
change is intentionally deferred, add a reference to the tracking issue in the
surrounding test/comment; otherwise make the loss independent of `nf` so the 2x
effect disappears.

---

Nitpick comments:
In `@deepmd/dpmodel/loss/dos.py`:
- Around line 133-143: The masked per-frame reduction in the DOS loss is
duplicated across the local DOS and local CDF paths and mirrored in other
backends, so extract it into a shared helper to reduce maintenance. Create a
small utility such as masked_per_frame_mse(diff3d, maskf, ndof) and have the DOS
block in dos.py call it instead of inlining the "per-frame sum / per-frame dof →
mean" logic. Keep behavior identical and reuse the helper wherever the same
masked reduction pattern appears, including the local CDF block and matching
implementations in tensor.py and the pt backends.

In `@source/tests/common/dpmodel/test_loss_padding.py`:
- Around line 204-207: The acdf/cdf grad-accum tests are just aliases of the
ados/dos tests and don’t add independent coverage. Update the relevant test
methods in test_loss_padding (test_acdf_grad_accum_invariant and
test_cdf_grad_accum_invariant) so they either isolate the
cumulative-distribution path by configuring _make_loss prefs for only acdf/cdf,
or remove the redundant alias tests altogether. Use the existing
test_ados_grad_accum_invariant and test_dos_grad_accum_invariant helpers as the
reference point when deciding whether to separate coverage or drop the wrappers.

In `@source/tests/pt/test_loss_padding.py`:
- Around line 1976-2064: The force_mag MAE grad-accum discrepancy is only
described in comments and not covered by CI. Add an explicit pytest xfail (or
skip) test near TestPTEnerSpinLossForceMagMSEGradAccum that exercises the MAE
path through _spin_loss_fn/EnergySpinLossPT and asserts the known 2x invariant
mismatch, so the documented behavior is tracked and won’t be forgotten. Keep the
existing MSE test unchanged and use the same make_A, make_B, and make_padded
helpers to locate the relevant loss behavior.
- Around line 109-122: Consolidate the duplicate test helpers by removing the
redundant `_EnerLossMockModel` and reusing `_MockModel` across the pt loss
tests. Keep the shared behavior in one helper with the same `__init__` and
`__call__` contract, and update the affected test classes in `test_loss_padding`
to instantiate that single helper instead of maintaining two identical mock
model classes. Ensure any existing expectations around the shallow copy behavior
and pre-populated `pred` mask remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3120253d-2b09-40b1-9819-187f97feab86

📥 Commits

Reviewing files that changed from the base of the PR and between ffe57a3 and 4990db2.

📒 Files selected for processing (13)
  • deepmd/dpmodel/loss/dos.py
  • deepmd/dpmodel/loss/ener.py
  • deepmd/dpmodel/loss/ener_spin.py
  • deepmd/dpmodel/loss/property.py
  • deepmd/dpmodel/loss/tensor.py
  • deepmd/pt/loss/dos.py
  • deepmd/pt/loss/ener.py
  • deepmd/pt/loss/ener_spin.py
  • deepmd/pt/loss/loss.py
  • deepmd/pt/loss/property.py
  • deepmd/pt/loss/tensor.py
  • source/tests/common/dpmodel/test_loss_padding.py
  • source/tests/pt/test_loss_padding.py

Comment on lines +2011 to +2019
# ---------------------------------------------------------------------------
# force_mag MSE grad-accum invariant (NM equal across frames, ghost atoms non-magnetic)
#
# NOTE: force_mag MAE (dpmodel) uses xp.sum over frames instead of xp.mean,
# so MAE force_mag FAILS the invariant with a 2x factor when frames=2.
# This is a pre-existing frame-normalization artifact independent of ghost-atom
# masking. Ghost atoms are correctly excluded (mask_mag=0 there). The MAE
# artifact is reported as NEEDS_CONTEXT in the audit report.
# ---------------------------------------------------------------------------

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect force_mag MAE normalization in the dpmodel spin loss.
fd -t f 'ener_spin.py' deepmd/dpmodel/loss --exec ast-grep outline {}
rg -nP -C4 'force_mag|mask_mag' deepmd/dpmodel/loss/ener_spin.py

Repository: deepmodeling/deepmd-kit

Length of output: 2756


Normalize force_mag MAE across frames. deepmd/dpmodel/loss/ener_spin.py still computes the MAE term as xp.sum(per_frame_sum_fm / per_frame_count_fm), so the loss scales with the number of frames (the 2× behavior for nf=2 remains). If this is out of scope here, link the tracking issue; otherwise switch it to a frame-wise mean to match the other ener_spin MAE paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/common/dpmodel/test_loss_padding.py` around lines 2011 - 2019,
The `force_mag` MAE path in `ener_spin.py` is still frame-scaled because it uses
a sum over per-frame averages instead of a true frame-wise mean. Update the MAE
computation in the `force_mag` loss logic to normalize across frames
consistently with the other `ener_spin` MAE branches, using the relevant symbols
around the `force_mag` reduction code path. If this change is intentionally
deferred, add a reference to the tracking issue in the surrounding test/comment;
otherwise make the loss independent of `nf` so the 2x effect disappears.

@njzjz-bot

Copy link
Copy Markdown
Contributor

Thanks for the careful fix. The important invariant here should indeed be frame-wise equivalence: a padded mixed-type batch should give the same loss/gradient as processing the real frames separately and averaging. This PR consistently applies that criterion across the pt/pt_expt loss paths, and the all-ones-mask guards are a good way to protect non-mixed behavior.

A few notes before merging:

  • Please wait for the remaining Python/C++ CI shards to finish; several were still pending when I checked.
  • The documented follow-ups are real behavior gaps, especially the unchanged TF backend and the pt-only dens/population/denoise losses. I would prefer these to have explicit tracking issues or TODO references before merge, so the scope boundary is not only in the PR body.
  • The ener_spin MAE normalization change for non-mixed cases is reasonable as a bug fix, but it is user-visible for existing trainings. It should be called out in the release note/changelog if this PR is merged.
  • I agree with tracking the known force_mag MAE frame-normalization inconsistency separately; an xfail/skip test or linked issue would make that debt harder to lose.

No blocker from my side on the main masking/normalization approach once CI is green and the follow-up tracking is clear.


Authored by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5)

@njzjz njzjz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve but see above non-blocking comments.

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.91786% with 85 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.22%. Comparing base (c381452) to head (4990db2).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/dpmodel/loss/ener.py 85.04% 32 Missing ⚠️
deepmd/pt/loss/ener.py 86.32% 29 Missing ⚠️
deepmd/pt/loss/ener_spin.py 90.90% 13 Missing ⚠️
deepmd/dpmodel/loss/ener_spin.py 90.59% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5738      +/-   ##
==========================================
- Coverage   81.28%   81.22%   -0.06%     
==========================================
  Files         988      990       +2     
  Lines      110891   111535     +644     
  Branches     4234     4232       -2     
==========================================
+ Hits        90137    90595     +458     
- Misses      19229    19412     +183     
- Partials     1525     1528       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants