fix(configurator): make env_params first-class to fix the trajectory cache key#901
fix(configurator): make env_params first-class to fix the trajectory cache key#901rutayan-nv wants to merge 10 commits into
Conversation
📝 WalkthroughWalkthroughThis PR implements domain randomization (environment parameter sampling) for CloudAI DSE workloads, refactors agent execution to a delegated run() method, introduces a Gymnasium-compatible adapter, and updates trajectory caching and observation logic to account for sampled environment state. ChangesDomain Randomization and Agent Execution Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.
…mutator Previously, ``test_run.step`` had no clear owner: the dispatcher set it from outside, the adapter rewound it on ``reset()``, and other callers wrote to it ad hoc. In RLlib custom-loop runs this collapsed every trial onto ``step=1``, overwriting ``trajectory.csv`` and ``env.csv`` rows. Centralize the invariant: ``TestRun.increment_step()`` is the single named mutator, and ``CloudAIGymEnv.step()`` is its only caller. One ``env.step()`` call advances the trial counter by exactly one — independent of any episode or dispatcher concept above the gym env. Contract tests in ``TestIncrementStep`` cover the API; ``test_cloudaigym`` asserts ``step`` is advanced *before* ``output_path`` and trajectory rows are computed, so cached and live trials both record the post-increment value.
…orphism Earlier commits in this PR introduced ``HAS_CUSTOM_TRAINING_LOOP`` + a ``CustomTrainingLoopAgent`` Protocol + a TypeGuard helper + an ``if/else`` in ``handle_dse_job`` to switch between the cloudai step loop and an agent-owned ``train()`` loop. That is a type-tagged conditional dispatching on agent identity — the textbook signal to replace conditional with polymorphism (Fowler). Add a default ``BaseAgent.run() -> int`` that holds the step-loop body (``select_action`` / ``env.step`` / ``update_policy`` per trial). Agents that drive their own training (RLlib, etc.) override ``run()`` to delegate to whatever loop they own and return a process-style exit code. ``handle_dse_job`` collapses to ``err |= agent.run()`` — one line, no branching, no Protocol vocabulary. The handler no longer knows that "custom training loops" exist as a category; that's an agent implementation detail. Net: -89 lines on cloudai. Surface area shrinks (no Protocol, no TypeGuard, no flag). ``test_handlers`` replaces the 5 helper unit tests + 2 dispatcher integration tests with 2 polymorphic tests asserting ``handle_dse_job`` delegates to ``agent.run()`` and propagates its return code.
- GymnasiumAdapter wraps a CloudAI BaseGym as a gymnasium-compatible env: Dict of Discrete actions over the tunable params, float32 Box observation derived from define_observation_space(), gymnasium 5-tuple from step/step_raw. - Fixed (single-value) params are injected on every step so agents only ever see the tunable subset; step_raw bypasses index decoding and fixed injection for callers that already have a fully-formed parameter dict. - The adapter mirrors handle_dse_job's 1-based test_run.step so artifact paths match whether the env is driven by the DSE loop or an external training loop. - CloudAIGymEnv.define_observation_space() now returns one slot per agent metric (at least one), giving adapters the correct Box shape. - gymnasium is an optional dependency: lazy-imported inside the adapter and exposed via a new rl extra (also added to dev) pinned to ~=1.2.
decode_action used to accept partial action dicts and step_raw accepted arbitrary key sets, both silently propagating incomplete params to the underlying env. Validate up front: - decode_action requires keys exactly == self._tunable_params, and each discrete index must be in range. - step_raw requires keys exactly == self._tunable_params | self._fixed_params. A new _assert_keys helper raises ValueError with sorted missing / extra lists so callers see exactly what was wrong. Adds tests for missing key, unknown key, out-of-range index, missing fixed param, and unknown raw key paths.
CloudAIGymEnv.get_cached_trajectory_result(action) keys the trajectory cache on action alone. When a workload declares env_params (e.g. drop_rate) and the agent re-selects the same action under a different env_params sample, the cache returns a stale reward measured under a different env, silently invalidating any domain-randomization workflow. Four tests pin the contract; today two FAIL (bug-exposing), two PASS (back-compat sanity). All 27 pre-existing tests are untouched. FAIL test_cache_miss_when_env_params_differ FAIL test_step_reruns_workload_when_env_params_change PASS test_cache_hit_when_action_and_env_params_match PASS test_cache_hit_when_neither_has_env_params Driver for the follow-up PR that adds env_params as a first-class field on TestDefinition + TrajectoryEntry and rewrites the cache key as (action, env_params). Both unit and integration shapes are covered so the fix can be validated end-to-end through env.step().
Domain randomization broke under the trajectory cache because the cache key was action-only: an agent revisiting the same action under a different env_params sample (e.g. drop_rate) got the *first* trial's reward served back, silently corrupting the reward signal. env.csv was also workload-owned and skipped on cache hits, so it was sparse and mis-aligned with trajectory.csv. This change makes env_params first-class in cloudai so the design works the same way for every agent (PPO, BO, GA, MAB) and every workload: - TestDefinition.env_params: dict[str, EnvParamSpec] - sibling to cmd_args, not part of the action space. - TestRun.current_env_params: dict[str, Any] - per-trial sample attached by the env so the cache key and downstream sinks see it. - TrajectoryEntry gains env_params; the cache key is now (action, current_env_params). - CloudAIGymEnv builds an EnvParamsObserver when env_params is declared. before_step() samples deterministically by (seed, name, trial), stashes the sample on test_run, and appends to env.csv *before* the cache lookup - so env.csv is written on every trial (cache hit or miss) and aligns 1:1 with trajectory.csv by step. - Workloads with no [env_params.*] block pay zero overhead: no observer, no env.csv, identical behaviour to today. Tests: turns the four TDD-red tests from the parent PR green and adds nine unit tests for EnvParamSpec / EnvParamsSampler / CsvSink / EnvParamsObserver plus two integration tests pinning the env.csv ↔ trajectory.csv contract.
Closes the most important gap in the env_params test matrix: the
cache-HIT path under an observer-driven TestDefinition.env_params.
The previous tests covered each axis separately - cache key unit tests
for env_params, a step()-level cache-miss integration with manually
set current_env_params, and a step-alignment test where all trials
were cache misses. None of them proved that a cache HIT still drives
the observer (and therefore env.csv) - which is precisely the
sparsity contract the fix was about.
The new test:
- declares env_params on the TestDefinition (real observer wired in),
- pre-seeds the trajectory with an entry whose action AND deterministic
env_params sample match what the observer will produce for step 1,
- calls env.step() once and asserts:
(a) runner.run is NOT called (cache short-circuit honored),
(b) env.csv gains a row (observer fired before the cache lookup),
(c) the new trajectory entry carries the per-trial env_params.
This is the end-to-end proof that the env.csv sparsity bug cannot
recur, and that env.csv ↔ trajectory.csv stay step-aligned on the
cache-hit path too.
d0375b4 to
d076095
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/configurator/cloudai_gym.py (2)
143-171: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winKeep
trajectory.csvaligned on the constraint-failure path.
observer.before_step()has already persisted the sampled env row by Line 147, but the early return on Lines 169-171 skipswrite_trajectory(). Any trial rejected byconstraint_check()now leaves an orphanenv.csvrow, which breaks the PR’s stated 1:1env.csv↔trajectory.csvcontract and makes per-trial artifacts inconsistent.Suggested fix
if not self.test_run.test.constraint_check(self.test_run, self.runner.system): logging.info("Constraint check failed. Skipping step.") - return [-1.0], self.rewards.constraint_failure, True, {} + observation = [-1.0] + reward = self.rewards.constraint_failure + self.write_trajectory( + TrajectoryEntry( + step=self.test_run.step, + action=action, + reward=reward, + observation=observation, + env_params=dict(self.test_run.current_env_params), + ) + ) + for observer in self.observers: + observer.after_step(self.test_run, observation, reward) + return observation, reward, True, {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 143 - 171, The early return on constraint_check leaves an env row without a matching trajectory; to fix, when test_run.test.constraint_check(test_run, runner.system) returns False, write a TrajectoryEntry (use self.write_trajectory with step=self.test_run.step, action=action, reward=self.rewards.constraint_failure, observation=[-1.0], env_params=dict(self.test_run.current_env_params)) and call each observer.after_step(self.test_run, observation, reward) before returning the same quadruple (observation, reward, True, {}); this mirrors the cached-result path and preserves the 1:1 env.csv ↔ trajectory.csv contract.
186-202: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winPreserve
current_env_paramswhen recovering from runner failures.On Line 189 the fallback rebuilds
self.test_runfromself.original_test_run, which resetscurrent_env_paramsto its default. The trajectory row written on Line 202 then records{}even thoughenv.csvalready contains the sampled params, and the cached failure is stored under(action, {})instead of(action, sampled_env_params). Repeating the same failing trial will miss the cache and produce disagreeing artifacts.Suggested fix
if self.runner.test_scenario.test_runs and self.runner.test_scenario.test_runs[0].output_path.exists(): self.test_run = self.runner.test_scenario.test_runs[0] else: - self.test_run = copy.deepcopy(self.original_test_run) + self.test_run = copy.deepcopy(self.original_test_run).apply_params_set(action) self.test_run.step = new_tr.step self.test_run.output_path = new_tr.output_path + self.test_run.current_env_params = dict(new_tr.current_env_params)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/cloudai_gym.py` around lines 186 - 202, When rebuilding self.test_run in the fallback branch (inside cloudai_gym.py) you must preserve the sampled env params so the trajectory/cache keys match the actual env.csv; after creating the deep-copied self.test_run from self.original_test_run and after setting step/output_path, copy the sampled params into self.test_run.current_env_params (e.g. assign a deepcopy of self.original_test_run.current_env_params or the sampled params present on the runner) before calling get_observation/compute_reward and write_trajectory; this ensures dict(self.test_run.current_env_params) in the TrajectoryEntry reflects the real sampled parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/configurator/base_agent.py`:
- Around line 116-149: In run(), replace the eager f-string logging calls with
%-style deferred logging so formatting happens only if the message will be
emitted; specifically update the two logging.info calls inside the run method
(the "Running step ..." and "Step {step}: Observation..." messages) to use
logging.info("Running step %s (of %s) with action %s", step, self.max_steps,
action) and logging.info("Step %s: Observation: %s, Reward: %.4f", step, ...,
reward) respectively, ensuring you format the observation list without using
f-strings so the logging framework can lazily interpolate.
In `@src/cloudai/configurator/env_params.py`:
- Around line 1-2: The file header years in
src/cloudai/configurator/env_params.py are incorrect for the touched year;
update the two top lines (the SPDX copyright line and the subsequent Copyright
line) by replacing "2024-2026" with the exact year string the repo's header test
expects for this file (e.g., "2026") so the header matches
tests/test_check_copyright_headers.py.
In `@src/cloudai/configurator/gymnasium_adapter.py`:
- Around line 133-165: Add a public close(self) -> None method to
GymnasiumAdapter that safely delegates to the wrapped environment's close
method: check for an attribute (e.g. getattr(self._env, "close", None)), call it
if callable, and return None; do not assume inheritance from gymnasium.Env. Add
the method near other lifecycle helpers (e.g. alongside
_sync_underlying_step_counter, _step_with_params) so callers that call
env.close() won't raise AttributeError.
- Around line 77-80: GymnasiumAdapter.__init__ currently filters action space
lists into _tunable_params (len>1) and _fixed_params (len==1) but silently drops
any keys with empty lists; modify GymnasiumAdapter.__init__ to detect any entry
in the raw_action_space where len(v) == 0 and raise a ValueError identifying the
offending key(s) so callers (e.g., TestRun.param_space) fail fast; add a unit
test that constructs a TestRun or raw_action_space containing an empty list for
a cmd_args key and asserts that GymnasiumAdapter(...) raises ValueError with a
message mentioning the empty key(s).
In `@tests/test_env_params.py`:
- Around line 86-87: The test uses a broad pytest.raises(ValueError) when
calling sink.write(0, {"drop_rate": 0.0}); narrow it by adding a match parameter
to pytest.raises that matches the specific ValueError message expected (e.g.,
the message about invalid drop_rate or index) so the test asserts the exact
error condition raised by sink.write; update the pytest.raises invocation around
sink.write to include match="expected message substring" referencing the
sink.write call and pytest.raises.
- Around line 1-2: The copyright header in the file contains the range
"2025-2026" which fails CI; update the header lines (the SPDX line and the
Copyright line that currently reads "2025-2026 NVIDIA CORPORATION & AFFILIATES")
to use a single year "2026" so they match the expected format for new 2026
files.
In `@tests/test_handlers.py`:
- Around line 287-288: The override of select_action in tests/test_handlers.py
has the wrong signature; change the function declaration for select_action in
that file to match the base class signature BaseAgent.select_action(self,
observation: list[float] | None = None) (i.e., add the observation parameter
with the same type and default) and keep the existing body (raising
AssertionError) so pyright/CI type-checking passes while preserving the test
intent.
---
Outside diff comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 143-171: The early return on constraint_check leaves an env row
without a matching trajectory; to fix, when
test_run.test.constraint_check(test_run, runner.system) returns False, write a
TrajectoryEntry (use self.write_trajectory with step=self.test_run.step,
action=action, reward=self.rewards.constraint_failure, observation=[-1.0],
env_params=dict(self.test_run.current_env_params)) and call each
observer.after_step(self.test_run, observation, reward) before returning the
same quadruple (observation, reward, True, {}); this mirrors the cached-result
path and preserves the 1:1 env.csv ↔ trajectory.csv contract.
- Around line 186-202: When rebuilding self.test_run in the fallback branch
(inside cloudai_gym.py) you must preserve the sampled env params so the
trajectory/cache keys match the actual env.csv; after creating the deep-copied
self.test_run from self.original_test_run and after setting step/output_path,
copy the sampled params into self.test_run.current_env_params (e.g. assign a
deepcopy of self.original_test_run.current_env_params or the sampled params
present on the runner) before calling get_observation/compute_reward and
write_trajectory; this ensures dict(self.test_run.current_env_params) in the
TrajectoryEntry reflects the real sampled parameters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0097c65c-5ab7-4bd2-8683-6e1ea26cc182
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
pyproject.tomlsrc/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/__init__.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/configurator/env_params.pysrc/cloudai/configurator/gymnasium_adapter.pysrc/cloudai/models/workload.pytests/test_cloudaigym.pytests/test_env_params.pytests/test_gymnasium_adapter.pytests/test_handlers.pytests/test_test_scenario.py
| def run(self) -> int: | ||
| """ | ||
| Orchestrate this agent's exploration over ``self.env``. | ||
|
|
||
| Default: a step loop driven by the dispatcher (``select_action`` → | ||
| ``env.step`` → ``update_policy`` per trial). Agents that drive their | ||
| own training loop (e.g. RLlib-based agents calling ``algo.train()``) | ||
| override this method. | ||
|
|
||
| Returns: | ||
| int: Process-style return code (``0`` success, non-zero failure). | ||
| ``handle_dse_job`` accumulates this via ``err |= agent.run()``. | ||
| """ | ||
| observation, _ = self.env.reset() | ||
| for _ in range(self.max_steps): | ||
| result = self.select_action(observation=observation) | ||
| if result is None: | ||
| break | ||
| step, action = result | ||
| logging.info(f"Running step {step} (of {self.max_steps}) with action {action}") | ||
| prev_observation = observation | ||
| observation, reward, done, *_ = self.env.step(action) | ||
| self.update_policy( | ||
| { | ||
| "trial_index": step, | ||
| "value": reward, | ||
| "observation": observation, | ||
| "prev_observation": prev_observation, | ||
| "action": action, | ||
| "done": done, | ||
| } | ||
| ) | ||
| logging.info(f"Step {step}: Observation: {[round(obs, 4) for obs in observation]}, Reward: {reward:.4f}") | ||
| return 0 |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value
Consider lazy-evaluation logging for consistency with Ruff G004.
The f-string logging on lines 135 and 148 works correctly but eagerly evaluates formatting arguments even when the log level would skip the message. Using %-style formatting (logging.info("Running step %s (of %s) with action %s", step, self.max_steps, action)) allows the logging framework to defer evaluation, improving performance in production when info-level logs may be disabled.
This is a minor style and performance optimization; the current implementation is functionally correct.
♻️ Proposed refactor to %-style logging
- logging.info(f"Running step {step} (of {self.max_steps}) with action {action}")
+ logging.info("Running step %s (of %s) with action %s", step, self.max_steps, action)
prev_observation = observation
observation, reward, done, *_ = self.env.step(action)
self.update_policy(
{
"trial_index": step,
"value": reward,
"observation": observation,
"prev_observation": prev_observation,
"action": action,
"done": done,
}
)
- logging.info(f"Step {step}: Observation: {[round(obs, 4) for obs in observation]}, Reward: {reward:.4f}")
+ logging.info(
+ "Step %s: Observation: %s, Reward: %.4f",
+ step,
+ [round(obs, 4) for obs in observation],
+ reward,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def run(self) -> int: | |
| """ | |
| Orchestrate this agent's exploration over ``self.env``. | |
| Default: a step loop driven by the dispatcher (``select_action`` → | |
| ``env.step`` → ``update_policy`` per trial). Agents that drive their | |
| own training loop (e.g. RLlib-based agents calling ``algo.train()``) | |
| override this method. | |
| Returns: | |
| int: Process-style return code (``0`` success, non-zero failure). | |
| ``handle_dse_job`` accumulates this via ``err |= agent.run()``. | |
| """ | |
| observation, _ = self.env.reset() | |
| for _ in range(self.max_steps): | |
| result = self.select_action(observation=observation) | |
| if result is None: | |
| break | |
| step, action = result | |
| logging.info(f"Running step {step} (of {self.max_steps}) with action {action}") | |
| prev_observation = observation | |
| observation, reward, done, *_ = self.env.step(action) | |
| self.update_policy( | |
| { | |
| "trial_index": step, | |
| "value": reward, | |
| "observation": observation, | |
| "prev_observation": prev_observation, | |
| "action": action, | |
| "done": done, | |
| } | |
| ) | |
| logging.info(f"Step {step}: Observation: {[round(obs, 4) for obs in observation]}, Reward: {reward:.4f}") | |
| return 0 | |
| def run(self) -> int: | |
| """ | |
| Orchestrate this agent's exploration over ``self.env``. | |
| Default: a step loop driven by the dispatcher (``select_action`` → | |
| ``env.step`` → ``update_policy`` per trial). Agents that drive their | |
| own training loop (e.g. RLlib-based agents calling ``algo.train()``) | |
| override this method. | |
| Returns: | |
| int: Process-style return code (``0`` success, non-zero failure). | |
| ``handle_dse_job`` accumulates this via ``err |= agent.run()``. | |
| """ | |
| observation, _ = self.env.reset() | |
| for _ in range(self.max_steps): | |
| result = self.select_action(observation=observation) | |
| if result is None: | |
| break | |
| step, action = result | |
| logging.info("Running step %s (of %s) with action %s", step, self.max_steps, action) | |
| prev_observation = observation | |
| observation, reward, done, *_ = self.env.step(action) | |
| self.update_policy( | |
| { | |
| "trial_index": step, | |
| "value": reward, | |
| "observation": observation, | |
| "prev_observation": prev_observation, | |
| "action": action, | |
| "done": done, | |
| } | |
| ) | |
| logging.info( | |
| "Step %s: Observation: %s, Reward: %.4f", | |
| step, | |
| [round(obs, 4) for obs in observation], | |
| reward, | |
| ) | |
| return 0 |
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 135-135: Logging statement uses f-string
(G004)
[warning] 148-148: Logging statement uses f-string
(G004)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/configurator/base_agent.py` around lines 116 - 149, In run(),
replace the eager f-string logging calls with %-style deferred logging so
formatting happens only if the message will be emitted; specifically update the
two logging.info calls inside the run method (the "Running step ..." and "Step
{step}: Observation..." messages) to use logging.info("Running step %s (of %s)
with action %s", step, self.max_steps, action) and logging.info("Step %s:
Observation: %s, Reward: %.4f", step, ..., reward) respectively, ensuring you
format the observation list without using f-strings so the logging framework can
lazily interpolate.
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | ||
| # Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Fix the copyright year header.
CI is already failing here: the new file is checked as touched in 2026, but the header still says 2024-2026. Update it to the exact year string the repo’s header test expects for this file, otherwise the PR cannot merge. Based on learnings, this repository’s copyright check requires touched-year formatting to match tests/test_check_copyright_headers.py.
🧰 Tools
🪛 GitHub Actions: CI / 0_Run pytest (3.12).txt
[error] 1-1: Copyright header year check failed in tests/test_check_copyright_headers.py. AssertionError: Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / 1_Run pytest (3.10).txt
[error] 1-1: Copyright header check failed in tests/test_check_copyright_headers.py. Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / Run pytest (3.10)
[error] 1-1: pytest tests/test_check_copyright_headers.py::test_src_copyright_header failed: Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / Run pytest (3.12)
[error] 1-1: test_src_copyright_header failed (Copyright year is not valid). Expected: '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'; Found: '# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/configurator/env_params.py` around lines 1 - 2, The file header
years in src/cloudai/configurator/env_params.py are incorrect for the touched
year; update the two top lines (the SPDX copyright line and the subsequent
Copyright line) by replacing "2024-2026" with the exact year string the repo's
header test expects for this file (e.g., "2026") so the header matches
tests/test_check_copyright_headers.py.
Sources: Learnings, Pipeline failures
| raw_action_space = env.define_action_space() | ||
| self._tunable_params: dict[str, list] = {k: v for k, v in raw_action_space.items() if len(v) > 1} | ||
| self._fixed_params: dict[str, Any] = {k: v[0] for k, v in raw_action_space.items() if len(v) == 1} | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== BaseGym contract and declarations ==="
rg -nC3 --type=py '\bclass\s+BaseGym\b|def define_action_space\s*\(' src tests
echo
echo "=== Concrete define_action_space implementations ==="
rg -nC3 --type=py 'def define_action_space\s*\(' src tests
echo
echo "=== Existing validation for empty candidate lists ==="
rg -nC3 --type=py 'ValueError|assert|len\([^)]*\)\s*==\s*0|if not [^:]+:' src testsRepository: NVIDIA/cloudai
Length of output: 50372
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect gymnasium_adapter.py around lines 60-110 ==="
nl -ba src/cloudai/configurator/gymnasium_adapter.py | sed -n '50,130p'
echo
echo "=== Search for empty candidate validation in gymnasium_adapter.py ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|==\s*0|empty|candidate|action_space' src/cloudai/configurator/gymnasium_adapter.py
echo
echo "=== Search tests for empty candidate cases ==="
rg -n --type=py 'empty|len\([^)]*\)\s*==\s*0|ValueError|candidate' tests/test_gymnasium_adapter.py
echo
echo "=== Show entire tests/test_gymnasium_adapter.py (small file?) ==="
wc -l tests/test_gymnasium_adapter.py
nl -ba tests/test_gymnasium_adapter.py | sed -n '1,220p'Repository: NVIDIA/cloudai
Length of output: 176
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect gymnasium_adapter.py around lines 60-110 ==="
nl -ba src/cloudai/configurator/gymnasium_adapter.py | sed -n '50,130p'
echo
echo "=== Search for empty candidate validation in gymnasium_adapter.py ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|==\s*0|empty|candidate|action_space' src/cloudai/configurator/gymnasium_adapter.py
echo
echo "=== Search tests for empty candidate cases ==="
rg -n --type=py 'empty|len\([^)]*\)\s*==\s*0|ValueError|candidate' tests/test_gymnasium_adapter.py
echo
echo "=== Show tests/test_gymnasium_adapter.py size and contents (if small) ==="
wc -l tests/test_gymnasium_adapter.py
nl -ba tests/test_gymnasium_adapter.py | sed -n '1,260p'Repository: NVIDIA/cloudai
Length of output: 176
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect gymnasium_adapter.py around lines 60-130 ==="
sed -n '60,130p' src/cloudai/configurator/gymnasium_adapter.py
echo
echo "=== Search for empty candidate validation in gymnasium_adapter.py ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|==\s*0|empty|candidate|action_space' src/cloudai/configurator/gymnasium_adapter.py
echo
echo "=== Search tests for empty candidate cases ==="
rg -n --type=py 'empty|len\([^)]*\)\s*==\s*0|ValueError|candidate' tests/test_gymnasium_adapter.py
echo
echo "=== Test file size and contents (first 260 lines) ==="
wc -l tests/test_gymnasium_adapter.py
sed -n '1,260p' tests/test_gymnasium_adapter.pyRepository: NVIDIA/cloudai
Length of output: 12033
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect BaseGym define_action_space docstring ==="
sed -n '1,120p' src/cloudai/configurator/base_gym.py
echo
echo "=== Inspect cloudai_gym define_action_space and where param_space comes from ==="
sed -n '1,200p' src/cloudai/configurator/cloudai_gym.py
echo
echo "=== Search for define_action_space usage/validation of empty lists ==="
rg -n --type=py 'define_action_space|param_space|action_space' src/cloudai/configurator
echo
echo "=== Search for explicit checks against empty candidate lists ([] or len==0) ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|not\s+[^:]+|empty\s+|ValueError|assert' src/cloudai/configurator/gymnasium_adapter.py src/cloudai/configurator/base_gym.py src/cloudai/configurator/cloudai_gym.pyRepository: NVIDIA/cloudai
Length of output: 15364
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for 'param_space' definition/assignment ==="
rg -n --type=py '\bparam_space\b\s*=' src
echo
echo "=== Search for 'apply_params_set' and how action keys/values are validated ==="
rg -n --type=py 'def apply_params_set\s*\(' src/cloudai
echo
echo "=== Show BaseGym / adapter helper _assert_keys and surrounding logic ==="
sed -n '110,210p' src/cloudai/configurator/gymnasium_adapter.py
echo
echo "=== Search for empties in param_space values (len(...)==0) across src ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|==\s*\[\s*\]|\[\s*\]\s*$' src/cloudaiRepository: NVIDIA/cloudai
Length of output: 113
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locate all param_space references ==="
rg -n --type=py '\bparam_space\b' src
echo
echo "=== Search for 'explorable' / action space construction utilities ==="
rg -n --type=py '(explorable|action_space|param_space|candidate|candidates|param|grid_search)' src/cloudai/configurator
echo
echo "=== Search for where TestRun params set are applied/validated ==="
rg -n --type=py 'apply_params_set|current_env_params|increment_step' src/cloudai
echo
echo "=== Search for any explicit validation of action values (empty list, len==0) ==="
rg -n --type=py 'len\([^)]*\)\s*==\s*0|ValueError\([^)]*empty|empty candidate|candidate list' src/cloudaiRepository: NVIDIA/cloudai
Length of output: 12650
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect src/cloudai/_core/test_scenario.py around param_space property ==="
sed -n '130,230p' src/cloudai/_core/test_scenario.py
echo
echo "=== Inspect apply_params_set in src/cloudai/_core/test_scenario.py ==="
sed -n '230,340p' src/cloudai/_core/test_scenario.py
echo
echo "=== Inspect env_params categorical spec validation for candidate lists ==="
sed -n '1,220p' src/cloudai/configurator/env_params.pyRepository: NVIDIA/cloudai
Length of output: 11074
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locate TestDefinition / cmd_args model types ==="
rg -n --type=py '\bclass\s+TestDefinition\b|cmd_args.*BaseModel|extra_env_vars|is_dse_excluded_arg|is_dse_job' src/cloudai/_core src/cloudai
echo
echo "=== Locate implementation of is_dse_excluded_arg ==="
rg -n --type=py 'def\s+is_dse_excluded_arg\s*\(' src/cloudai
echo
echo "=== Inspect likely DSE candidate-list validations (min_length, non-empty, ValueError) ==="
rg -n --type=py 'min_length\s*=\s*1|min_length\s*=\s*2|non-empty|nonempty|empty|len\(.*\)\s*==\s*0|min_items|min_length|validate.*list|model_validator.*list' src/cloudai/_core src/cloudai/configurator src/cloudaiRepository: NVIDIA/cloudai
Length of output: 13895
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect src/cloudai/models/workload.py around TestDefinition and DSE helpers ==="
sed -n '1,260p' src/cloudai/models/workload.py
echo
echo "=== Inspect remaining part of workload.py for cmd_args model validators ==="
sed -n '260,520p' src/cloudai/models/workload.py
echo
echo "=== Search within workload.py for list validators on cmd_args / action params ==="
rg -n --type=py 'cmd_args|model_validator|validator|Field\(|min_length|min_items|empty|ValueError' src/cloudai/models/workload.pyRepository: NVIDIA/cloudai
Length of output: 9136
Add a fail-fast guard for empty candidate lists
TestRun.param_space includes cmd_args entries whenever the value is a list (even []), but GymnasiumAdapter.__init__ only treats len(v) > 1 as tunable and len(v) == 1 as fixed—so len(v) == 0 keys are dropped from both _tunable_params and _fixed_params, silently changing the action schema and leaving the underlying cmd_args unset. Add a ValueError guard for len(v) == 0 in GymnasiumAdapter and a unit test for the empty-list case.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/configurator/gymnasium_adapter.py` around lines 77 - 80,
GymnasiumAdapter.__init__ currently filters action space lists into
_tunable_params (len>1) and _fixed_params (len==1) but silently drops any keys
with empty lists; modify GymnasiumAdapter.__init__ to detect any entry in the
raw_action_space where len(v) == 0 and raise a ValueError identifying the
offending key(s) so callers (e.g., TestRun.param_space) fail fast; add a unit
test that constructs a TestRun or raw_action_space containing an empty list for
a cmd_args key and asserts that GymnasiumAdapter(...) raises ValueError with a
message mentioning the empty key(s).
| def render(self) -> None: | ||
| self._env.render() | ||
|
|
||
| @staticmethod | ||
| def _assert_keys(received: Any, expected: set[str], ctx: str) -> None: | ||
| received_set = set(received) | ||
| if received_set == expected: | ||
| return | ||
| missing = sorted(expected - received_set) | ||
| extra = sorted(received_set - expected) | ||
| raise ValueError(f"{ctx} keys mismatch; missing={missing}, extra={extra}") | ||
|
|
||
| def _step_with_params(self, params: dict[str, Any]) -> tuple[Any, float, bool, bool, dict[str, Any]]: | ||
| self._sync_underlying_step_counter() | ||
| obs, reward, done, info = self._env.step(params) | ||
| self._step_count += 1 | ||
| return self._as_obs_array(obs), float(reward), bool(done), False, info | ||
|
|
||
| def _sync_underlying_step_counter(self) -> None: | ||
| """ | ||
| Mirror ``handle_dse_job``'s 1-based ``test_run.step`` so artifact paths match. | ||
|
|
||
| The first step is written under ``…/<iteration>/1/``, matching how | ||
| ``handle_dse_job`` numbers steps; this keeps reports and trajectory | ||
| analysis consistent regardless of whether the env is driven by the | ||
| DSE loop or by an external training loop wrapping the adapter. | ||
| """ | ||
| test_run = getattr(self._env, "test_run", None) | ||
| if test_run is not None: | ||
| test_run.step = self._step_count + 1 | ||
|
|
||
| def _as_obs_array(self, obs: Any) -> Any: | ||
| return self._np.asarray(obs, dtype=self._np.float32) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/cloudai/configurator/gymnasium_adapter.py"
echo "== File head =="
sed -n '1,220p' "$FILE"
echo "== Search for close() in this file =="
rg -n "def close\\(" "$FILE" || true
echo "== Search for close usage in repo =="
rg -n "\\.close\\(" src || true
echo "== Search for gymnasium_adapter usage =="
rg -n "gymnasium_adapter" -S . || trueRepository: NVIDIA/cloudai
Length of output: 7178
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "close" tests/test_gymnasium_adapter.py src/cloudai/configurator/gymnasium_adapter.py || true
sed -n '1,260p' tests/test_gymnasium_adapter.pyRepository: NVIDIA/cloudai
Length of output: 7906
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect BaseGym for close()
fd -a "base_gym.py" src/cloudai/configurator -t f -x 'echo {}' || true
BASE_GYM_FILE="$(fd -a "base_gym.py" src/cloudai/configurator -t f | head -n 1)"
echo "Using: $BASE_GYM_FILE"
sed -n '1,240p' "$BASE_GYM_FILE"
echo "== Search for close() usage on GymnasiumAdapter instances =="
rg -n "GymnasiumAdapter\\(" -S src tests || true
rg -n "gymnasium_adapter\\.GymnasiumAdapter|configurator\\.GymnasiumAdapter|from cloudai\\.configurator import GymnasiumAdapter" -S src tests || true
rg -n "\\.close\\(\\)" -S src tests || true
echo "== Search for __getattr__ / delegation patterns in gymnasium_adapter.py =="
rg -n "__getattr__|__getattribute__|wrap|delegate" src/cloudai/configurator/gymnasium_adapter.py || trueRepository: NVIDIA/cloudai
Length of output: 4640
🌐 Web query:
Gymnasium Env close method API optional wrapper close
💡 Result:
In the Gymnasium API, the close method is used to release resources (such as rendering windows, open files, or database connections) used by an environment [1][2]. Implementation of the close method is optional in custom environments [2]. If an environment does not require any specific resource cleanup, you do not need to implement this method [2]. When using Gymnasium wrappers: 1. The base gymnasium.Wrapper class provides a default close method that calls close on the underlying environment attribute (self.env.close) [3]. 2. If you are creating a custom wrapper that adds its own resource-dependent logic, you can override the close method [4][5]. In such cases, it is standard practice to ensure the underlying environment's close method is still called, typically by including super.close in your override to maintain the chain of closing across wrapped environments [3]. You should not interact with an environment after calling its close method [2]. Additionally, the Gymnasium environment checker utility may test the close method to ensure it functions correctly and can be called multiple times without raising an error [6].
Citations:
- 1: https://gymnasium.farama.org/api/env/
- 2: https://gymnasium.farama.org/main/tutorials/gymnasium_basics/environment_creation/
- 3: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py
- 4: https://gymnasium.farama.org/main/tutorials/gymnasium_basics/implementing_custom_wrappers/
- 5: https://gymnasium.farama.org/tutorials/gymnasium_basics/implementing_custom_wrappers/
- 6: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/utils/env_checker.py
🌐 Web query:
Gymnasium Env class has close() default implementation no-op core.py
💡 Result:
In the Gymnasium library, the Env class does not define a default, no-op implementation of the close method in its core.py file. Rather, close is defined as an abstract method or a required interface that custom environments are expected to implement if they need to release external resources [1][2][3]. While Gymnasium's documentation acknowledges that in many cases a custom environment may not need to implement close, the base Env class does not provide a functional default [4][5]. Users are encouraged to override this method to perform necessary cleanup, such as closing windows (e.g., Pygame) or releasing file handles [4][5]. Because there is no default implementation in the base Env class, calling close on an environment that has not explicitly defined it—or failing to implement it when resources are involved—can lead to issues where resources remain open [4][6]. Consequently, users must explicitly manage environment cleanup to ensure proper resource management [6].
Citations:
- 1: https://github.com/Farama-Foundation/Gymnasium/blob/58c6b7b49da20418b21df632866eb215f8c26876/gymnasium/core.py
- 2: https://gymnasium.farama.org/api/env/
- 3: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py
- 4: https://gymnasium.farama.org/main/tutorials/gymnasium_basics/environment_creation/
- 5: https://gymnasium.farama.org/v1.1.0/tutorials/gymnasium_basics/environment_creation/
- 6: Removed destructor of VectorEnv as it causes VectorEnv.close() to be called unintentionally. Fixes #1521. Farama-Foundation/Gymnasium#1522
Add close() to the Gymnasium-shaped adapter API
GymnasiumAdapter is a standalone “Env-shaped” object (it doesn’t inherit from gymnasium.Env/Wrapper) and the class provides no close() method, so teardown code that calls env.close() can fail with AttributeError. Delegate to the wrapped BaseGym.close when available.
♻️ Proposed fix
def render(self) -> None:
self._env.render()
+ def close(self) -> None:
+ close = getattr(self._env, "close", None)
+ if callable(close):
+ close()
+
`@staticmethod`
def _assert_keys(received: Any, expected: set[str], ctx: str) -> None:
received_set = set(received)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def render(self) -> None: | |
| self._env.render() | |
| @staticmethod | |
| def _assert_keys(received: Any, expected: set[str], ctx: str) -> None: | |
| received_set = set(received) | |
| if received_set == expected: | |
| return | |
| missing = sorted(expected - received_set) | |
| extra = sorted(received_set - expected) | |
| raise ValueError(f"{ctx} keys mismatch; missing={missing}, extra={extra}") | |
| def _step_with_params(self, params: dict[str, Any]) -> tuple[Any, float, bool, bool, dict[str, Any]]: | |
| self._sync_underlying_step_counter() | |
| obs, reward, done, info = self._env.step(params) | |
| self._step_count += 1 | |
| return self._as_obs_array(obs), float(reward), bool(done), False, info | |
| def _sync_underlying_step_counter(self) -> None: | |
| """ | |
| Mirror ``handle_dse_job``'s 1-based ``test_run.step`` so artifact paths match. | |
| The first step is written under ``…/<iteration>/1/``, matching how | |
| ``handle_dse_job`` numbers steps; this keeps reports and trajectory | |
| analysis consistent regardless of whether the env is driven by the | |
| DSE loop or by an external training loop wrapping the adapter. | |
| """ | |
| test_run = getattr(self._env, "test_run", None) | |
| if test_run is not None: | |
| test_run.step = self._step_count + 1 | |
| def _as_obs_array(self, obs: Any) -> Any: | |
| return self._np.asarray(obs, dtype=self._np.float32) | |
| def render(self) -> None: | |
| self._env.render() | |
| def close(self) -> None: | |
| close = getattr(self._env, "close", None) | |
| if callable(close): | |
| close() | |
| `@staticmethod` | |
| def _assert_keys(received: Any, expected: set[str], ctx: str) -> None: | |
| received_set = set(received) | |
| if received_set == expected: | |
| return | |
| missing = sorted(expected - received_set) | |
| extra = sorted(received_set - expected) | |
| raise ValueError(f"{ctx} keys mismatch; missing={missing}, extra={extra}") | |
| def _step_with_params(self, params: dict[str, Any]) -> tuple[Any, float, bool, bool, dict[str, Any]]: | |
| self._sync_underlying_step_counter() | |
| obs, reward, done, info = self._env.step(params) | |
| self._step_count += 1 | |
| return self._as_obs_array(obs), float(reward), bool(done), False, info | |
| def _sync_underlying_step_counter(self) -> None: | |
| """ | |
| Mirror ``handle_dse_job``'s 1-based ``test_run.step`` so artifact paths match. | |
| The first step is written under ``…/<iteration>/1/``, matching how | |
| ``handle_dse_job`` numbers steps; this keeps reports and trajectory | |
| analysis consistent regardless of whether the env is driven by the | |
| DSE loop or by an external training loop wrapping the adapter. | |
| """ | |
| test_run = getattr(self._env, "test_run", None) | |
| if test_run is not None: | |
| test_run.step = self._step_count + 1 | |
| def _as_obs_array(self, obs: Any) -> Any: | |
| return self._np.asarray(obs, dtype=self._np.float32) |
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 137-137: Dynamically typed expressions (typing.Any) are disallowed in received
(ANN401)
[warning] 143-143: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 164-164: Dynamically typed expressions (typing.Any) are disallowed in obs
(ANN401)
[warning] 164-164: Dynamically typed expressions (typing.Any) are disallowed in _as_obs_array
(ANN401)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/configurator/gymnasium_adapter.py` around lines 133 - 165, Add a
public close(self) -> None method to GymnasiumAdapter that safely delegates to
the wrapped environment's close method: check for an attribute (e.g.
getattr(self._env, "close", None)), call it if callable, and return None; do not
assume inheritance from gymnasium.Env. Add the method near other lifecycle
helpers (e.g. alongside _sync_underlying_step_counter, _step_with_params) so
callers that call env.close() won't raise AttributeError.
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | ||
| # Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win
Fix copyright header year to match CI expectations.
The copyright header shows 2025-2026 but CI expects 2026 for new files created in 2026. This is breaking the copyright header check in CI.
📅 Proposed fix
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
-# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0Based on learnings: In the NVIDIA/cloudai repo, copyright headers' year formatting must match the test logic enforced by tests/test_check_copyright_headers.py. New files should list only the year they were created.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | |
| # Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | |
| # Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
🧰 Tools
🪛 GitHub Actions: CI / 0_Run pytest (3.12).txt
[error] 1-1: Copyright header year check failed in tests/test_check_copyright_headers.py. AssertionError: Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / 1_Run pytest (3.10).txt
[error] 1-1: Copyright header check failed in tests/test_check_copyright_headers.py. Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / Run pytest (3.10)
[error] 1-1: pytest tests/test_check_copyright_headers.py::test_src_copyright_header failed: Copyright year is not valid. Expected '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.' but found '# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🪛 GitHub Actions: CI / Run pytest (3.12)
[error] 1-1: test_src_copyright_header failed (Copyright year is not valid). Expected: '# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'; Found: '# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_env_params.py` around lines 1 - 2, The copyright header in the
file contains the range "2025-2026" which fails CI; update the header lines (the
SPDX line and the Copyright line that currently reads "2025-2026 NVIDIA
CORPORATION & AFFILIATES") to use a single year "2026" so they match the
expected format for new 2026 files.
Sources: Learnings, Pipeline failures
| with pytest.raises(ValueError): | ||
| sink.write(0, {"drop_rate": 0.0}) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Add match parameter to narrow the pytest.raises scope.
The bare pytest.raises(ValueError) is broad. Adding a match parameter would verify the specific error condition being tested.
🔍 Proposed fix
- with pytest.raises(ValueError):
- sink.write(0, {"drop_rate": 0.0})
+ with pytest.raises(ValueError, match=r"step.*must.*positive|step.*0"):
+ sink.write(0, {"drop_rate": 0.0})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with pytest.raises(ValueError): | |
| sink.write(0, {"drop_rate": 0.0}) | |
| with pytest.raises(ValueError, match=r"step.*must.*positive|step.*0"): | |
| sink.write(0, {"drop_rate": 0.0}) |
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 86-86: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception
(PT011)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_env_params.py` around lines 86 - 87, The test uses a broad
pytest.raises(ValueError) when calling sink.write(0, {"drop_rate": 0.0}); narrow
it by adding a match parameter to pytest.raises that matches the specific
ValueError message expected (e.g., the message about invalid drop_rate or index)
so the test asserts the exact error condition raised by sink.write; update the
pytest.raises invocation around sink.write to include match="expected message
substring" referencing the sink.write call and pytest.raises.
Source: Linters/SAST tools
| def select_action(self) -> tuple[int, dict[str, Any]]: # pragma: no cover - never called | ||
| raise AssertionError("select_action must not be called when run() is overridden") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
Fix select_action() signature to match base class.
The override is missing the observation parameter that exists in BaseAgent.select_action(self, observation: list[float] | None = None). This breaks pyright type checking and CI.
🔧 Proposed fix
- def select_action(self) -> tuple[int, dict[str, Any]]: # pragma: no cover - never called
+ def select_action(self, observation: list[float] | None = None) -> tuple[int, dict[str, Any]]: # pragma: no cover
raise AssertionError("select_action must not be called when run() is overridden")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def select_action(self) -> tuple[int, dict[str, Any]]: # pragma: no cover - never called | |
| raise AssertionError("select_action must not be called when run() is overridden") | |
| def select_action(self, observation: list[float] | None = None) -> tuple[int, dict[str, Any]]: # pragma: no cover | |
| raise AssertionError("select_action must not be called when run() is overridden") |
🧰 Tools
🪛 GitHub Actions: CI / 2_Linting.txt
[error] 287-287: pyright failed (hook id: pyright). Method "select_action" overrides class "BaseAgent" incompatibly: positional parameter count mismatch (base has 2, override has 1). Parameter 2 mismatch: base "observation" is keyword parameter, override parameter is position-only (reportIncompatibleMethodOverride).
🪛 GitHub Actions: CI / Linting
[error] 287-287: pyright (hook id: pyright) failed: Method "select_action" overrides class "BaseAgent" in an incompatible manner (reportIncompatibleMethodOverride). Positional parameter count mismatch (base has 2, override has 1). Parameter 2 mismatch: base parameter "observation" is keyword parameter, override parameter is position-only.
🪛 Ruff (0.15.15)
[warning] 288-288: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_handlers.py` around lines 287 - 288, The override of select_action
in tests/test_handlers.py has the wrong signature; change the function
declaration for select_action in that file to match the base class signature
BaseAgent.select_action(self, observation: list[float] | None = None) (i.e., add
the observation parameter with the same type and default) and keep the existing
body (raising AssertionError) so pyright/CI type-checking passes while
preserving the test intent.
Source: Pipeline failures
Summary
Turns the four TDD-red tests in #900 green and makes the underlying
design change required to fix domain randomization in cloudai.
Today,
CloudAIGymEnv.get_cached_trajectory_result()keys the cache onactiononly. An agent revisiting the same action under a differentenv_paramssample (e.g.drop_rate) gets the first trial's rewardserved back from cache, silently corrupting the reward signal. The
parallel artifact (
env.csv) lived in workload code and was skipped oncache hits, so it was sparse and mis-aligned with
trajectory.csv.This PR moves
env_paramsfrom a workload-private convention intofirst-class cloudai schema so the same fix benefits every agent (PPO,
BO, GA, MAB) and every workload uniformly:
TestDefinition.env_params: dict[str, EnvParamSpec]— sibling tocmd_args; not part of the agent's action space.TestRun.current_env_params: dict[str, Any]— per-trial sampleattached by the env, visible to the cache key and to downstream
sinks.
TrajectoryEntrygainsenv_params. The cache key is now(action, current_env_params).CloudAIGymEnvbuilds anEnvParamsObserverwhenever theTestDefinitiondeclaresenv_params.before_step()samplesdeterministically by
(seed, name, trial), stashes the sample ontest_run, and appends toenv.csvbefore the cache lookup — soenv.csvis written on every trial (cache hit or miss) and aligns1:1 with
trajectory.csvonstep.[env_params.*]block pay zero overhead: noobserver, no
env.csv, identical behaviour to today.A follow-up PR in cloudaix will delete the now-redundant
env_params.py/env_params_sampler.py/env_params_sink.pymodules in
domain_randomization/workloads/mrc_prt/and route thedemo workload through the cloudai-owned schema.
Test plan
test_cache_miss_when_env_params_differ,test_step_reruns_workload_when_env_params_change, plus two controltests) now pass.
tests/test_env_params.pycoverEnvParamSpecvalidation,
EnvParamsSamplerdeterminism + per-parameterindependence,
CsvSinkformat + edge cases, andEnvParamsObserverwiring.tests/test_cloudaigym.pypin theenv.csv↔trajectory.csvstep-alignment contract and thezero-overhead claim for workloads without
env_params.ruff check/ruff format/pyrightclean on all touched files.Stacking
Depends on #900 (the TDD-red tests). Either rebase #900 onto this PR
after merge, or merge them together. Both are stacked on #893
(
rpatro/step-ownership-tdd).Stack & rebase status (updated 2026-06-12)
Stack:
main← #893 ← #900 ← #901 (top)Re-stacked on #900. Carries the env_params first-class fix + the cache-hit integration test. Folded in the
increment_step()test adaptation — the env_params tests now letTestRun.increment_step()drive the trial counter, since #893's step-ownership change sits below this PR in the stack.main.