Skip to content

24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446

Open
fyu1 wants to merge 868 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes3
Open

24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446
fyu1 wants to merge 868 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes3

Conversation

@fyu1

@fyu1 fyu1 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Backport this commit to fix a bug: https://nvbugspro.nvidia.com/bug/6207279

Since the commit is in 6.19 upstream already, 7.0 bos and lts don't have this issue.

Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation.

Cc: Zeng Heng zengheng4@huawei.com
Reviewed-by: Gavin Shan gshan@redhat.com
Reviewed-by: Zeng Heng zengheng4@huawei.com
Reviewed-by: Jonathan Cameron jonathan.cameron@huawei.com
Reviewed-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Reviewed-by: Fenghua Yu fenghuay@nvidia.com
Tested-by: Carl Worth carl@os.amperecomputing.com
Tested-by: Gavin Shan gshan@redhat.com
Tested-by: Zeng Heng zengheng4@huawei.com
Tested-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Tested-by: Hanjun Guo guohanjun@huawei.com

(backported from commit b353637)

[fenghuay: Fix mem bw monitoring counter overflow issue.

  • Resolve conflict in mpam_msmon_overflow_val();
  • Resolve conflict in __ris_msmon_read(); ]

davejiang and others added 30 commits May 21, 2026 17:27
BugLink: https://bugs.launchpad.net/bugs/2143032

The HPA to DPA translation for poison injection assumes that the
base address starts from where the CXL region begins. When the
extended linear cache is active, the offset can be within the DRAM
region. Adjust the offset so that it correctly reflects the offset
within the CXL region.

[ dj: Add fixes tag from Alison ]

Fixes: c3dd676 ("cxl/region: Add inject and clear poison by region offset")
Link: https://patch.msgid.link/20251031173224.3537030-5-dave.jiang@intel.com
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit b6cfddd)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

The node/zone quirk section of the cxl documentation is incorrect.
The actual reason for fallback allocation misbehavior in the
described configuration is due to a kswapd/reclaim thrashing scenario
fixed by the linked patch.  Remove this section.

Link: https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 82b5d7e)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

devm_cxl_port_enumerate_dports() is not longer used after below commit
commit 4f06d81 ("cxl: Defer dport allocation for switch ports")

Delete it and the relevant interface implemented in cxl_test.

Signed-off-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 3f5b8f7)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

- Corrected spelling of "bandwdith" -> "bandwidth"
- Fixed "wht" -> "with"

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 040acb4)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

system_wq should be the per-cpu workqueue, yet in this name nothing makes
that clear, so replace system_wq with system_percpu_wq.

The old wq (system_wq) will be kept for a few release cycles.

See 128ea9f ("workqueue: Add system_percpu_wq and system_dfl_wq")
for cause of changes.

[ dj: Add reference to commit that initiated the change. ]

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251030163839.307752-1-marco.crivellari@suse.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 952e905)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

In preparation for adding a test module that exercises the address
translation calculations, extract the core calculations into stand-
alone functions that operate on base parameters without dependencies
on struct cxl_region.

Perform additional parameter validation to protect against a test
module sending bad parameters. Export the validation function, as
well as the three core translation functions for use by test module
cxl_translate only.

This refactoring enables unit testing of the address translation logic
with controlled inputs, while preserving identical functionality in
the existing code paths.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit b78b9e7)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

In preparation for adding a test module that can exercise the address
translation functions performed by the CXL Driver, refactor the XOR
implementation like this:

- Extract the core calculation into a standalone helper function,
- Export the new function for use by test module cxl_translate only,
- Enhance the parameter validation since this new function will be
  called from a test module with no guarantee of valid parameters,
- Move the define of struct cxl_cxims_data to cxl.h so the test module
  can build xormaps.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 4fe516d)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Commit 4fe516d ("cxl/acpi: Make the XOR calculations available
for testing") split xormap handling code to create a reusable helper
function but inadvertently dropped the check of HBIW values before
dereferencing cxlrd->platform_data. When HBIW is 1 or 3, no xormaps
are needed and platform_data may be NULL, leading to a potential NULL
pointer dereference.

Affects platform configs using XOR Arithmetic with HBIWs of 1 or 3,
when performing DPA->HPA address translation for CXL events. Those
events would be any of poison ops, general media, or dram.

Restore the early return check for HBIW values of 1 and 3 before
dereferencing platform_data.

Fixes: 4fe516d ("cxl/acpi: Make the XOR calculations available for testing")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260109194946.431083-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 49d1063)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Add a loadable test module that validates CXL address translation
calculations using parameterized test vectors. The module tests both
host-to-device and device-to-host address translations for Modulo and
XOR interleave arithmetic.

Two types of testing are provided:

1. Parameterized test vectors:
   Test vectors are passed as module parameters in the format:
	"dpa pos r_eiw r_eig hb_ways math expected_spa".
   Round-trip validation is performed:
   - Translate a DPA and position to a SPA
   - Verify the result matches expected SPA
   - Translate that SPA back to a DPA and position
   - Verify round-trip consistency

2. Internal validation testing:
   When no test vectors are provided, the module performs validation
   of the translation functions by checking parameter boundaries and
   running 10,000 iterations of randomly generated valid parameters
   to exercise the core calculation functions.

The module uses the CXL Driver translation functions through symbols
exported exclusively for cxl_translate.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 06377c5)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

The cxl_acpi module spams "Extended linear cache calculation failed"
when the hmat memory target is not found for a node. This is normal
when the memory target does not contain extended linear cache
attributes. Adjust cxl_acpi_set_cache_size() to just return 0 if error
is returned from hmat_get_extended_linear_cache_size(). That is the
only error returned from hmat_get_extended_linear_cache_size() as
-ENOENT.

Also remove the check for -EOPNOTSUPP in cxl_setup_extended_linear_cache()
since that errno is never returned by cxl_acpi_set_cache_size().

[dj: Flipped minor return logic suggested by Jonathan ]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251003185509.3215900-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit f0c5d3b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Add a region sysfs attribute to show the size of the extended linear
cache if there is any. The attribute is invisible when the cache
size is 0, which indicates it does not exist.

Moved the cxl_region_visible() location in order to pick up the
new sysfs attribute definition.

[ dj: Fixed spelling errors noted by Benjamin ]

Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251022203052.4078527-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit d6602e2)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Failing the first sysfs_update_group() needs to explicitly
kfree the resource as it is too early for cxl_region_iomem_release()
to do so.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Fixes: d6602e2 (cxl/region: Add support to indicate region has extended linear cache)
Link: https://patch.msgid.link/20260202191330.245608-1-dave@stgolabs.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 77b310b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

The size of this type is architecture specific, and the recommended
way to print it portably is through the custom %pap format string.

Fixes: d6602e2 ("cxl/region: Add support to indicate region has extended linear cache")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20251204095237.1032528-1-arnd@kernel.org
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 88c72ba)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

When a decoder is locked, it means that its configuration cannot be
changed. CXL spec r3.2 8.2.4.20.13 discusses the details regarding
locked decoders. Locking happens when bit 8 of the decoder control
register is set and then the decoder is committed afterwards (CXL
spec r3.2 8.2.4.20.7).

Given that the driver creates a virtual decoder for each CFMWS, the
Fixed Device Configuration (bit 4) of the Window Restriction field is
considered as locking for the virtual decoder by the driver.

The current driver code disregards the locked status and a region can
be destroyed regardless of the locking state.

Add a region flag to indicate the region is in a locked configuration.
The driver will considered a region locked if the CFMWS or any decoder
is configured as locked. The consideration is all or nothing regarding
the locked state. It is reasonable to determine the region "locked"
status while the region is being assembled based on the decoders.

Add a check in region commit_store() to intercept when a 0 is written
to the commit sysfs attribute in order to prevent the destruction of a
region when in locked state. This should be the only entry point from user
space to destroy a region.

Add a check is added to cxl_decoder_reset() to prevent resetting a locked
decoder within the kernel driver.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251105201826.2901915-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 2230c4b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

The CXL decoder flags are defined as bitmasks, not bit indices.
Using test_bit() to check them interprets the mask value as a bit
index, which is the wrong test.

For CXL_DECODER_F_LOCK the test reads beyond the defined bits, causing
the test to always return false and allowing resets that should have
been blocked.

Replace test_bit() with a bitmask check.

Fixes: 2230c4b ("cxl: Add handling of locked CXL decoder")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Tested-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/98851c4770e4631753cf9f75b58a3a6daeca2ea2.1771873256.git.alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(backported from commit 0a70b7c)
[jan: Resolve conflict in cxl_region_setup_flags() for handling CXL_DECODER_F_LOCK check]
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

With the current code flow, once the generic target is updated
target->registered is set and the remaining code is skipped.
So return immediately instead of going through the checks and
then skip.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251105235115.85062-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 15e1426)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

The function name region_res_match_cxl_range() does not accurately
convey the operation of address comparison with cache size. Rename
to spa_maps_hpa() to provide a better function name.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/linux-cxl/68eea19c7e67e_2f899100a8@dwillia2-mobl4.notmuch/
Reviewed-by: Jonathan Cameron <jonathan.cameron@huwei.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251106170108.1468304-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit c43521b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Update the comment in spa_maps_hpa() to clearly convey the construction
of extended linear cache.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/linux-cxl/68eea19c7e67e_2f899100a8@dwillia2-mobl4.notmuch/
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251106170108.1468304-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 8d27dd0)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

A root decoder's callback handlers are collected in struct cxl_rd_ops.
The structure is dynamically allocated, though it contains only a few
pointers in it. This also requires to check two pointes to check for
the existence of a callback.

Simplify the allocation, release and handler check by embedding the
ops statically in struct cxl_root_decoder.

Implementation is equivalent to how struct cxl_root_ops handles the
callbacks.

[ dj: Fix spelling error in commit log. ]

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/20251114075844.1315805-2-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(backported from commit 6123133)
[jan: Resolve minor conflict due to code lines shift]
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Simplify the xor arithmetric setup code by grouping it in a single
block. No need to split the block for QoS setup.

It is safe to reorder the call of cxl_setup_extended_linear_cache()
because there are no dependencies.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Tested-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251114075844.1315805-3-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit c42a4d2)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Simplify the code by removing local variable @inc. The variable is not
used elsewhere, remove it and directly increment the target number.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/20251114075844.1315805-4-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 7e71fa6)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Create a global define for the size of the mock CXL auto region used
in cxl_test. Remove the declared size in mock_init_hdm_decoder()
function.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit fa59c35)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Add a module parameter to allow activation of extended linear cache
on the auto region for cxl_test. The current platform implementation
for extended linear cache is 1:1 of DRAM and CXL memory. A CFMWS is
created with the size of both memory together where DRAM takes the
first part of the memory range and CXL covers the second part. The
current CXL auto region on cxl_test consists of 2 256M devices that
creates a 512M region. The new extended linear cache setup will have
512M DRAM and 512M CXL memory for a total of 1G CFMWS. The hardware
decoders must have their starting offset moved to after the DRAM region
to handle the CXL regions.

[ dj: Fixup commenting style. (Jonathan) ]

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(backported from commit 4b1c046)
[jan: Resolve minor conflict due to code line "base = window->base_hpa" being moved]
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Add the mock wrappers for hmat_get_extended_linear_cache_size() in order
to emulate the ACPI helper function for the regions that are mock'd by
cxl_test.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-4-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 68f4a85)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Since commit 733b57f ("cxl/pci: Early setup RCH dport component registers from RCRB")
is not necessary under mocking tests.

[ dj: Fixup commit representation flagged by checkpatch. ]
[ dj: Ammend subject line to indicate which function. ]

Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/20251118182202.2083244-1-alejandro.lucero-palau@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit 26c5b0d)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Commit 364ee9f ("cxl/test: Enhance event testing") changed the
loop iterator in mock_get_event() from a static constant,
CXL_TEST_EVENT_CNT, to a dynamic global variable, ret_limit. The
intent was to vary the number of events returned per call to simulate
events occurring while logs are being read.

However, ret_limit is modified without synchronization. When multiple
threads call mock_get_event() concurrently, one thread may read
ret_limit, another thread may increment it, and the first thread's
loop condition and size calculation see and use the updated value.

This is visible during cxl_test module load when all memdevs are
initializing simultaneously, which includes getting event records. It
is not tied to the cxl-events.sh unit test specifically, as that
operates on a single memdev.

While no actual harm results (the buffer is always large enough and
the record count fields correctly reflect what was written), this is
a correctness issue. The race creates an inconsistent state within
mock_get_event() and adding variability based on a race appears
unintended.

Make ret_limit a local variable populated from an atomic counter. Each
call gets a stable value that won't change during execution. That
preserves the intended behavior of varying the return counts across
calls while eliminating the race condition.

This implementation uses "+ 1" to produce the full range of 1 to
CXL_TEST_EVENT_RET_MAX (4) records. Previously only 1, 2, 3 were
produced.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013819.1713780-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit b6369da)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

mock_get_event() uses an uninitialized local variable, nr_overflow, to
populate the overflow_err_count field. That results in incorrect
overflow_err_count values in mocked cxl_overflow trace events, such as
this case where the records are reported as 0 and should be non-zero:

[] cxl_overflow: memdev=mem7 host=cxl_mem.6 serial=7: log=Failure : 0 records from 1763228189130895685 to 1763228193130896180

Fix by using log->nr_overflow and remove the unused local variable.

A follow-up change was considered in cxl_mem_get_records_log() to
confirm that the overflow_err_count is non-zero when the overflow flag
is set [1]. Since the driver has no functional dependency on this
constraint, and a device that violates this specific requirement does
not cause incorrect driver behavior, no validation check is added.

[1] CXL 3.2, Table 8-65 Get Event Records Output Payload

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013036.1713313-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
(cherry picked from commit f1840ef)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Holding a reference to a device does not prevent its driver data from
going away so there is no point in keeping the reference after looking
up the sart device.

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sven Peter <sven@kernel.org>
(cherry picked from commit f95f3bc)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2143032

Simplify the canvas lookup error handling by dropping the OF node
reference sooner.

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Link: https://patch.msgid.link/20250926142454.5929-3-johan@kernel.org
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
(cherry picked from commit 075daf2)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…ire SoC

BugLink: https://bugs.launchpad.net/bugs/2143032

"mss-top-sysreg" contains clocks, pinctrl, resets, an interrupt controller
and more. At this point, only the reset controller child is described as
that's all that is described by the existing bindings.
The clock controller already has a dedicated node, and will retain it as
there are other clock regions, so like the mailbox, a compatible-based
lookup of the syscon is sufficient to keep the clock driver working as
before, so no child is needed. There's also an interrupt multiplexing
service provided by this syscon, for which there is work in progress at
[1].

Link: https://lore.kernel.org/linux-gpio/20240723-uncouple-enforcer-7c48e4a4fefe@wendy/ [1]
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
(cherry picked from commit feaa716)
Signed-off-by: Jiandi An <jan@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
mrutland-arm and others added 4 commits June 12, 2026 11:46
BugLink: https://bugs.launchpad.net/bugs/2156557

Add cputype definitions for C1-Premium. These will be used for errata
detection in subsequent patches.

These values can be found in the C1-Premium TRM:

  https://developer.arm.com/documentation/109416/0100/

... in section A.5.1 ("MIDR_EL1, Main ID Register").

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
(backported from commit d28413b linux-next)
[mochs: Minor context adjustment due to absent definitions]
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2156557

A number of CPUs developed by Arm suffer from errata whereby a broadcast
TLBI;DSB sequence may complete before the global observation of writes
which are translated by an affected TLB entry.

These errata ONLY affect the completion of memory accesses which have
been translated by an invalidated TLB entry, and these errata DO NOT
affect the actual invalidation of TLB entries. TLB entries are removed
correctly.

This issue has been assigned CVE ID CVE-2025-10263.

To mitigate this issue, Arm recommends that software follows any
affected TLBI;DSB sequence with an additional TLBI;DSB, which will
ensure that all memory write effects affected by the first TLBI have
been globally observed. The additional TLBI can use any operation that
is broadcast to affected CPUs, and the additional DSB can use any option
that is sufficient to complete the additional TLBI.

The ARM64_WORKAROUND_REPEAT_TLBI workaround is sufficient to mitigate
the issue. Enable this workaround for affected CPUs, and update the
silicon errata documentation accordingly.

Note that due to the manner in which Arm develops IP and tracks errata,
some CPUs share a common erratum number.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
(backported from commit cfd391e linux-next)
[mochs: Minor context adjustment due to absent definitions]
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2156557

NVIDIA Olympus cores are affected by the TLBI completion issue tracked as
CVE-2025-10263. The existing ARM64_ERRATUM_4118414 handling already uses
ARM64_WORKAROUND_REPEAT_TLBI to issue an additional broadcast TLBI;DSB
sequence and ensure affected memory write effects are globally observed.

Add MIDR_NVIDIA_OLYMPUS to the repeat-TLBI match list so the same
mitigation is enabled on affected Olympus systems. Also document the
NVIDIA Olympus erratum in the arm64 silicon errata table and list it in
the Kconfig help text.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit ec7216f linux-next)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
CVE-Enable ARM64_ERRATUM_4118414 to mitigate 2025-10263 on NVIDIA platforms.
BugLink: https://bugs.launchpad.net/bugs/2156557

Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
@fyu1

fyu1 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

@fyu1 Codex is calling out these findings...

Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.

  1. Overflow increment is off by one

At drivers/resctrl/mpam_devices.c:1383-1392:

case mpam_feat_msmon_mbwu_63counter: return GENMASK_ULL(62, 0); case mpam_feat_msmon_mbwu_44counter: return GENMASK_ULL(43, 0); case mpam_feat_msmon_mbwu_31counter: return GENMASK_ULL(30, 0);

Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1.

But after this patch, the value is added directly on overflow at :1488-1489:

if (overflow) mbwu_state->correction += mpam_msmon_overflow_val(m->type);

The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit downstream counters, the same logic should be BIT_ULL(44) and BIT_ULL(63).

Fixed.

  1. T241 scaling was lost for overflow correction

Existing downstream code scales the sampled MBWU value at :1480-1481:

if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) now *= 64;

Before this backport, overflow correction also applied the same scale:

overflow_val = mpam_msmon_overflow_val(m->type); if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) overflow_val *= 64; overflow_val -= mbwu_state->prev_val;

The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the current code adds only about 2^31.

The fix should make the overflow correction use the same unit as now.

  1. Long MBWU overflow status is ignored

This branch has long MBWU counter support and defines two relevant status bits:

MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 / MSMON_CFG_x_CTL_OFLOW_STATUS / bit 26 */

clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343:

*cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS;

if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU) *cur_ctl &= ~MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L;

But the new overflow detection only checks bit 26 at :1435:

overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS;

So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky status bit.

The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.

  1. Dead/stale state remains after removing prev-value overflow detection

At :1405:

u64 now, overflow_val = 0;

overflow_val is no longer used after the patch. That should produce an unused-variable warning.
Fixed.

Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed the only read of prev_val. That is no longer functionally harmful by itself, but it is stale backport debris and makes the state model misleading.
Won't fixed since it's harmless.

@fyu1

fyu1 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

@fyu1 Codex is calling out these findings...
Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.

  1. Overflow increment is off by one

At drivers/resctrl/mpam_devices.c:1383-1392:
case mpam_feat_msmon_mbwu_63counter: return GENMASK_ULL(62, 0); case mpam_feat_msmon_mbwu_44counter: return GENMASK_ULL(43, 0); case mpam_feat_msmon_mbwu_31counter: return GENMASK_ULL(30, 0);
Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1.
But after this patch, the value is added directly on overflow at :1488-1489:
if (overflow) mbwu_state->correction += mpam_msmon_overflow_val(m->type);
The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit downstream counters, the same logic should be BIT_ULL(44) and BIT_ULL(63).

  1. T241 scaling was lost for overflow correction

Existing downstream code scales the sampled MBWU value at :1480-1481:
if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) now *= 64;
Before this backport, overflow correction also applied the same scale:
overflow_val = mpam_msmon_overflow_val(m->type); if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) overflow_val *= 64; overflow_val -= mbwu_state->prev_val;
The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the current code adds only about 2^31.
The fix should make the overflow correction use the same unit as now.

  1. Long MBWU overflow status is ignored

This branch has long MBWU counter support and defines two relevant status bits:
MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 / MSMON_CFG_x_CTL_OFLOW_STATUS / bit 26 */
clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343:
*cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS;
if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU) *cur_ctl &= ~MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L;
But the new overflow detection only checks bit 26 at :1435:
overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS;
So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky status bit.
The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.

  1. Dead/stale state remains after removing prev-value overflow detection

At :1405:
u64 now, overflow_val = 0;
overflow_val is no longer used after the patch. That should produce an unused-variable warning.
Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed the only read of prev_val. That is no longer functionally harmful by itself, but it is stale backport debris and makes the state model misleading.

Claude missed 1, but also flagged 2-4.

Additionally, the commit message ought to have your SOB go after the backport notes:

...
[fenghuay: Fix mem bw monitoring counter overflow issue.
 - Resolve conflict in mpam_msmon_overflow_val();
 - Resolve conflict in __ris_msmon_read();
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

Fixed.

@fyu1 fyu1 force-pushed the 24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 branch 3 times, most recently from 10c5aaa to d841a88 Compare June 15, 2026 16:51
@fyu1

fyu1 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased. No code change.

@clsotog

clsotog commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

This is a finding at codex:
drivers/resctrl/mpam_devices.c:1415 declares overflow without initializing it. Initialize overflow = false.

@nvmochs

nvmochs commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@fyu1

In addition to the initialization issue that Carol pointed out, my review also popped up these findings:

  • Medium: T241 63-bit overflow correction is still inconsistent if the 63-bit path is used. now is scaled for all MBWU widths at mpam_devices.c:1502-1503, but mpam_msmon_overflow_val() skips * 64 for mpam_feat_msmon_mbwu_63counter at :1403-1405. That mixes byte-scaled samples with raw-count overflow correction for 63-bit counters. Either both now and overflow correction should scale for 63-bit, or neither should.

More detail...

› In the final branch, for T241 MBWU reads:

    if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc))
          now *= 64;

That scales now for all MBWU counter widths, including mpam_feat_msmon_mbwu_63counter.

But overflow correction does:

    if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) &&
        type != mpam_feat_msmon_mbwu_63counter)
          overflow_val *= 64;

So for 63-bit counters specifically:

now = scaled by 64
overflow correction = not scaled by 64


  • Medium/Low: the final branch still prefers 44-bit counters over 63-bit counters at mpam_devices.c:1574-1577. Upstream 9e5afb7 prefers mpam_feat_msmon_mbwu_63counter first. Since probe sets 44-bit when long counters exist and then also sets 63-bit when LWD exists, this branch will normally never choose 63-bit via the generic MBWU path. If intentional, it should be called out in the backport notes; otherwise swap the order.

Lastly, it appears Jamie's comment about the ordering of the annotation notes in relation to your SOB have not been addressed - I'm still seeing the notes come after the SOB, when they should be after the pick tag, but before the SOB.

Example of the ordering I was anticipating:

   (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
    [fenghuay: Partial port; NV tree already has long-counter read paths,
     mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround (d7c811dd0171).
     This commit ports per-width overflow correction, OFLOW_STATUS_L overflow
     detection, and T241 overflow correction scaling via mpam_msmon_overflow_val().
     Differences from upstream 9e5afb7/dc48eb1:
     - 44/63-bit overflow: check OFLOW_STATUS_L | OFLOW_STATUS; upstream
       9e5afb7 uses OFLOW_STATUS_L only.
     - T241 now *= 64 applies to all counter widths; upstream dc48eb1 skips 63-bit.
    ]
    Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

@fyu1

fyu1 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

To clarify, this is a patch from v6.19 (not v6.18) and was part of the "[PATCH v6 00/34] arm_mpam: Add basic mpam driver".

@fyu1 Are there other patches from this series that are missing from the 6.17-HWE kernel?

One more patch is added to help the main patch. Other patches are cosmetic patches and no need to be ported.

@fyu1 fyu1 force-pushed the 24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 branch from d841a88 to 8e479a9 Compare June 16, 2026 23:16
@fyu1

fyu1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

This is a finding at codex: drivers/resctrl/mpam_devices.c:1415 declares overflow without initializing it. Initialize overflow = false.

Fixed.

@fyu1

fyu1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@fyu1

In addition to the initialization issue that Carol pointed out, my review also popped up these findings:

  • Medium: T241 63-bit overflow correction is still inconsistent if the 63-bit path is used. now is scaled for all MBWU widths at mpam_devices.c:1502-1503, but mpam_msmon_overflow_val() skips * 64 for mpam_feat_msmon_mbwu_63counter at :1403-1405. That mixes byte-scaled samples with raw-count overflow correction for 63-bit counters. Either both now and overflow correction should scale for 63-bit, or neither should.

More detail...

› In the final branch, for T241 MBWU reads:

    if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc))
          now *= 64;

That scales now for all MBWU counter widths, including mpam_feat_msmon_mbwu_63counter.

But overflow correction does:

    if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) &&
        type != mpam_feat_msmon_mbwu_63counter)
          overflow_val *= 64;

So for 63-bit counters specifically:

now = scaled by 64 overflow correction = not scaled by 64

Fixed.

  • Medium/Low: the final branch still prefers 44-bit counters over 63-bit counters at mpam_devices.c:1574-1577. Upstream 9e5afb7 prefers mpam_feat_msmon_mbwu_63counter first. Since probe sets 44-bit when long counters exist and then also sets 63-bit when LWD exists, this branch will normally never choose 63-bit via the generic MBWU path. If intentional, it should be called out in the backport notes; otherwise swap the order.

Fixed.

Lastly, it appears Jamie's comment about the ordering of the annotation notes in relation to your SOB have not been addressed - I'm still seeing the notes come after the SOB, when they should be after the pick tag, but before the SOB.

Example of the ordering I was anticipating:

   (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
    [fenghuay: Partial port; NV tree already has long-counter read paths,
     mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround (d7c811dd0171).
     This commit ports per-width overflow correction, OFLOW_STATUS_L overflow
     detection, and T241 overflow correction scaling via mpam_msmon_overflow_val().
     Differences from upstream 9e5afb7/dc48eb1:
     - 44/63-bit overflow: check OFLOW_STATUS_L | OFLOW_STATUS; upstream
       9e5afb7 uses OFLOW_STATUS_L only.
     - T241 now *= 64 applies to all counter widths; upstream dc48eb1 skips 63-bit.
    ]
    Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

Fixed.

@fyu1

fyu1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

The updated PR addressed comments from Mat and Carol.

Please review and merge it if it's good.

Thanks.

-Fenghua

@clsotog

clsotog commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

@fyu1 Can you rebase the PR?

@nvmochs

nvmochs commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

@fyu1 All of the coding issues have been addressed.

The remaining issue is commit-message ordering in the second commit 8e479a9.

It still has:

  (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
  Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
  [fenghuay: Partial port; this tree already has long-counter read paths,
  ...
  ]

That should be:

  (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
  [fenghuay: ...]
  Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

The first commit 3d2e100 has the ordering fixed correctly.

Use the overflow status bit to track overflow on each bandwidth counter
read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read
of the counter. Overflow interrupts, on hardware that supports them could
be used to remove this limitation.

Cc: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(backported from commit b353637)
[fenghuay: Differences from upstream b353637:
 - Retains NV long-counter read paths, reset_on_next_read, and T241 now *= 64.
 - Applies T241 overflow correction * 64 inline in __ris_msmon_read() (moved
   to mpam_msmon_overflow_val() wrapper in the 9e5afb7 follow-up commit).
 - Initialize overflow to false.
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
@fyu1 fyu1 force-pushed the 24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 branch from 8e479a9 to 1bdbe45 Compare June 17, 2026 00:06
Now that the larger counter sizes are probed, make use of them.

Callers of mpam_msmon_read() may not know (or care!) about the different
counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the
driver pick the counter to use.

Only 32bit accesses to the MSC are required to be supported by the
spec, but these registers are 64bits. The lower half may overflow
into the higher half between two 32bit reads. To avoid this, use
a helper that reads the top half multiple times to check for overflow.

Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[morse: merged multiple patches from Rohit, added explicit counter selection ]
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Peter Newman <peternewman@google.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(backported from commit 9e5afb7)
[fenghuay: Partial port; this tree already has long-counter read paths,
 mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround.
 This commit ports per-width overflow correction, OFLOW_STATUS_L overflow
 detection, and T241 overflow correction scaling via mpam_msmon_overflow_val().
 Apply T241-MPAM-6 * 64 scaling to 31/44-bit counters only; skip 63-bit
 counters for both the counter read and overflow correction.
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
@fyu1

fyu1 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

@fyu1 Can you rebase the PR?

Rebased.

@fyu1

fyu1 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

@fyu1 All of the coding issues have been addressed.

The remaining issue is commit-message ordering in the second commit 8e479a9.

It still has:

  (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
  Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
  [fenghuay: Partial port; this tree already has long-counter read paths,
  ...
  ]

That should be:

  (backported from commit 9e5afb7c32830bcd123976a7729ef4e2dff0cd77)
  [fenghuay: ...]
  Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

The first commit 3d2e100 has the ordering fixed correctly.

Fixed.

@nvmochs

nvmochs commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Thanks @fyu1! No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@jamieNguyenNVIDIA

Copy link
Copy Markdown
Collaborator

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nirmoy nirmoy added has_2_acks and removed help wanted Extra attention is needed labels Jun 17, 2026
@clsotog

clsotog commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Acked-by: Carol L Soto <csoto@nvidia.com>

@nvidia-bfigg nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from a53b6ed to 7a62271 Compare June 18, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.