Skip to content

Rebased updates from Stanford and CEA#226

Merged
cedeyn merged 47 commits into
cea-hpc:masterfrom
rezib:pr/all-updates
Jun 3, 2025
Merged

Rebased updates from Stanford and CEA#226
cedeyn merged 47 commits into
cea-hpc:masterfrom
rezib:pr/all-updates

Conversation

@rezib

@rezib rezib commented Apr 22, 2025

Copy link
Copy Markdown
Contributor

This pull request contains all updates retrieved from Stanford and CEA production branch, including Python 3 support and Shine-HA feature.

Edit: 2 additional commits have been pushed to make unit tests successful in GitHub Actions environments. Successful execution of CI can be found here: https://github.com/rezib/shine/actions/runs/15046341361/job/42289732238

rezib and others added 30 commits October 8, 2024 16:53
In Python 3, __import__ expect an absolute module name. This fixes
"ModuleNotFoundError: No module named 'File'" error for the following
tests:

- target with ha_node
- target with several ha_nodes
- test_index_external (Configuration.BackendFileTest.BackendFileTest
                       .test_index_external)
- test_multiple_matches (Configuration.BackendFileTest.BackendFileTest
                         .test_multiple_matches)
- test_simple_mgs (Configuration.BackendFileTest.BackendFileTest
                   .test_simple_mgs)

Change-Id: If8306c096c25e7486d128fdf72b9a44f883898bc
This commit fixes "TypeError: unhashable type: 'NodeSet'" error raised
in the following tests:

- Install a non-existent file is correctly reported
- Install to a bad node is correctly reported
- Install to a mix of bad and good nodes is correctly reported
- Install a simple file
- send a done message which fails update due to bad property
- message with compname value is backward compatible
- message shine version 2 is compatible
- send a start and done message and then crashes
- simulate unable to run python
- send a start message then crashes
- send a start and done message
- install on unreachable nodes raises an error

Change-Id: Ia7625b1da6436543e8f9ec8cb8bebbd4004a3537
This commit fixes "TypeError: startswith first arg must be bytes or a
tuple of bytes, not str" in the following tests:

- send a done message which fails update due to bad property
- message with compname value is backward compatible
- message shine version 2 is compatible
- send a start and done message and then crashes
- simulate unable to run python
- send a start message then crashes
- send a start and done message

This also fixes "TypeError: a bytes-like object is required, not 'str'"
error in this test:

- send a forged message which fails due to bad pickle content

Change-Id: Ic2cf841306d3689d48979311d72bb9435371aa06
Starting with version 1.8, ClusterShell MsgTreeElem __str__() method
raise an error. We have to call .message().decode() to get the string.
Also fix MsgTree .add() expecting bytes.

The fixes "TypeError: cannot get string from MsgTreeElem, use bytes
instead" on the following tests:

- Install a non-existent file is correctly reported
- Install to a bad node is correctly reported
- Install to a mix of bad and good nodes is correctly reported
- Install a simple file
- install on unreachable nodes raises an error
- send a done message which fails update due to bad property
- message with compname value is backward compatible
- message shine version 2 is compatible
- send a start and done message and then crashes
- simulate unable to run python
- send a start message then crashes
- send a start and done message
- send a done message which fails update due to bad property
- send a forged message which fails due to bad pickle content

This fixes "TypeError: sequence item 0: expected a bytes-like object, str found" on the following tests:

- send a done message which fails update due to bad property
- send a start and done message and then crashes
- simulate unable to run python
- send a start message then crashes
- send a forged message which fails due to bad pickle content

This fixes "TypeError: string argument without an encoding" on the
following tests:

- send a start and done message and then crashes
- simulate unable to run python
- send a start message then crashes

Change-Id: I8ecf1797bfb7879d434576770fc1232846088d6a
Python 3 does not support sorted() with None values anymore. Replace
None values with -inf instead. Note that float('-inf') is preferred over
-math.inf to keep Python 2.7 compatibility.

This fixes "TypeError: '<' not supported between instances of 'int' and
'NoneType'" raised in the following tests:

- test failover node has a state and all others have none
- test master node has a state and all others have none
- test component started on two different nodes

Change-Id: I93f751889d00580ba700f31c656a2f102bdae880
Dictionaries are insertion ordered in Python 3.6+, this changes expected
results for these tests. Also use collections.OrderedDict() in order to
keep consistent results with Python < 3.6.

Change-Id: Icce89e2fde77bb64cda956b0202181879fbe80ca
This test expect integer division to get expected result, use integer
division operator to avoid getting float value that makes the test fail
with "AssertionError: [6] != None" eventually.

Change-Id: Iaf493ee8c27f244f443edbc4c140ec49aba6d9da
Fix regression introduced in e3e5be4 that makes sort_key argument a
requirement to get predictive sorted output. This notably fixes random
AssertionError failures the following test:

- fill with a support filter

Change-Id: Ic50bf0b2e52770d654c9fd7a07251b89ba04d4ee
Instead of checking against hard-coded UnpicklingError message that is
subject to change between Python versions, check against a more generic
regex to simply check the error is properly managed by
shine_msg_unpack().

For reference with Python 3.6, the error is "unpickling stack underflow"
instead of "pop from empty list".

Change-Id: I2d9d54c1137157df94069fad11b9dcc98a2fa608
Sets are not ordered in Python 3. In TuningParameter class, it has the
effect of producing unpredictable results for __str__() method. The
_node_types attribute is converted to a list to avoid spurious random
failure with "AssertionError: 'foo=0 types=client,mds nodes=toto15' !=
'foo=0 types=mds,client nodes=toto15'" in the following test:

- test TuningParameter.__str__()

Change-Id: Ia99b2e622d11d95f96a9e5fd78bf3c7c0e6d9bca
Sets in Python 3 are not ordered, thus producing unpredictible results
in orders of modules in the graph. This is fixed by converting into a
list and checking for duplicate values before insertion. This fixes the
following tests that failed randomly with "AssertionError: 'ldiskfs' !=
'lustre'":

- prepare is ok with or without tunings
- prepare a simple action on a local component

Change-Id: I2f1ba3c5cb2cbfb4cf543014282c5d1939a18d6a
Update the test to adopt standard assertRaisesRegex[p] method and update
the regex to accept error message generated on el8. This fixes the
following test:

======================================================================
FAIL: install on unreachable nodes raises an error
----------------------------------------------------------------------
Traceback (most recent call last):
  File "shine/tests/Lustre/FileSystemTest.py", line 156, in
  test_install_unreachable
    fs.install(fs_config_file=Utils.makeTempFilename())
Shine.Lustre.FileSystem.FSRemoteError: badnode[1-2]: Copy failed: ssh:
Could not resolve hostname badnode[1-2]: No address associated with
hostname
lost connection [rc=1]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "shine/tests/Lustre/FileSystemTest.py", line 162, in
  test_install_unreachable
    self.assertTrue(str(ex).endswith("badnode[1-2]: Name or service not"
AssertionError: False is not true

Change-Id: I1606b3eb04b6f7cd23d2150be8bda3cbf6d85196
ModelFile supports spaces in values when linesep='\n' but not when
linesep=' '. This patch adds quote support to allow constructions like:

    foo: bar="my test" id=7

Change-Id: I18c833cc40c4b09ba6ac0414c520c9c9c553a017
This is a first implementation of device start/stop actions that are
defined in the model file as dev_action and associated with any targets
using dev_run=<action_alias>.

If dev_run is defined for a target, its device doesn't need to be
present on start, as the start command of the associated dev_action
will be executed prior to the target mount to allow the device to be
created. For stop, the Lustre target is first unmounted, and then the
device stop action is called.

Change-Id: I2afe0bc257e5c6742acee4ed7ea7e308126a7e66
Add a new target status: no_device when the target device is not
available on the host.

A missing device is not always an error, like with mdadm.

Change-Id: Ifd402c6b5b4f827995e5bc664a1a5ab3c8fe8ebf
Change-Id: If74f6f277a6cc79851ecf5c0d9cda59c1498dc58
The 'shine' remote executable path was based on sys.argv[0] and thus the
Shine module can only work with the provided 'shine' CLI.

Please note that with this change, 'shine' must be found in the remote
node $PATH.

Change-Id: I485d7ec70ea2ecbf41596565243a72820ec81327
Change-Id: Ic33aa9adb393a434615be18153a55f77e3cad368
This patch introduces shine-HA, an extra tool to monitoring, alerting
and performing automatic Lustre target failover.

This first patch supports monitoring and alerting of Lustre targets.

Please take a look at conf/ha.yaml as an example of configuration file.
All defined parameters are supported by the current code.

Two Alert plugins are already provided: local emails and Slack
notifications.  Additional Alert plugins may be installed in
Shine/HA/plugins/.

Three alert levels can be configured: INFO, WARN and CRIT.

Alerts on transcient Lustre target states can be avoided by tuning the
thresholds found in the fs_monitor_state_count_thresholds section in
ha.yaml.

Change-Id: I222ea0b26ba9cd1d23c98fa74c780bcae562ebe1
Change-Id: I94f86187f5ceca583ba9442757650aa616d7a6c2
Change-Id: I17eae7096ea068539b473c2aa91ae0bc0d43be0b
Change-Id: Ib00080289466d014a8218f18790d1ba8ce7ec7ea
In config (ha.yaml):
- custom pinger command
- command timeout option
- alert thresholds

Change-Id: I8f62a79b670d3baaa770da3fd312c042696d14ab
Change-Id: Icf67c14240049d14953f6d4d81fcf834f32f8782
This patch adds basic Lustre HA feature to shine-HA. It is now able to
perform fencing of a server having all targets in errors and lnet
unreachable (eg.  server crash). If the fence command is successful,
HACore will perform a failover of the targets through the shine API, of
both resident and non-resident targets on the selected failover servers.

Change-Id: I9adf723e24f57b5418ce577241f98bcaf6b4cb78
- keep track of current StatusThread
- add support for action thread invalidation
- invalidate StatusThread if still active after polling_interval
- retry status after invalidation instead of waiting forever

Change-Id: I40c91cc92e008009898c0d7c06487c0a495114fd
Change-Id: Icf4f18c3948e4733bd4f0270f190974f3acddc28
Method execute_fs() should not return None.

Change-Id: I5f9ba34c8a944b8e41df3c413f9844b19bedc1cf
Change-Id: Iea9bf672bf803f71656388e1709e251353eb87e6
Change-Id: I1c8e97183c74a2203c51ab76a6cb6b7f89cf4fe5
thiell and others added 15 commits October 11, 2024 14:58
Signed-off-by: Stephane Thiell <sthiell@stanford.edu>
Change-Id: Ia6fc00d38349bc7eb9a21f5efca4e4022ae719f1
/sys/fs/lustre/*/%s/mntdev also catches /sys/fs/lustre/mgs/%s/mntdev
which we don't want.

Change glob pattern to:

	/sys/fs/lustre/osd-*/%s/mntdev

Signed-off-by: Stephane Thiell <sthiell@stanford.edu>
Change-Id: I12faaa57bd3ab0931faf505960b749d297914d58
Signed-off-by: Stephane Thiell <sthiell@stanford.edu>
Change-Id: Ie19775cd4f97edf988dce2ff3049a4b4a6f020c8
Shine now tries to mount both lustre_tgt and lustre filesystem types.
With this option, the mount command initially tries to execute
mount.lustre_tgt. When this command is not found, it fallbacks to
mount.lustre. With this option, Shine is both future proof while keeping
backward compatibility.

Note that lustre_tgt is set in the first place as it is more prone to
succeed in the future.

Co-authored-by: Rémi Palancher <remi@rackslab.io>
Change-Id: Ibed0c8936ca0c16da26cb48577973d1408db897d
Change-Id: I7eff3e77580b562aed810516c9dbe4c38b1005a3
If lnet_conf is set, (un)configure the lnet network like the lnet service
would after loading/before unloading modules.

Closes #211.

Change-Id: I9be102223248364e9b905f38e89eb9583e1f2d4a
Trying to figure out which target to start first based on flags is horrible
with "extended" mode, because that status is done on all nodes when it's not
necessary nor actually always possible

Change-Id: I59f2115cbd444b51b795bdf0c1d7b0f9e8c5fe63
This is a bit intrusive but will eventually be necessary to
not check all failover targets on status for example

Change-Id: I4dcdf855e7fdec88cfe95fa12dabd0aa40001fe6
…evel

Status doesn't actually need to check disk level details for most
usages:
 - when the filesystem is started, infos from /proc is usually enough
 - even if it is offline, infos displayed don't require flags/etc
unless the formatting option requires it (-V target or -O %flags)

Change-Id: I1a64a0f490bb287f965607d177e0d6511ca980f9
Change-Id: I9f0a10bee0cb7d3282d60142cec79c864f7731cb
Lets us not unload modules after stop or umount.
This is useful for e.g. HA targets where we do not want to unload
modules on voluntary migration

Change-Id: I96e196b077812f0e18e590507a20d6feaaa65172
There are problems with writeconf if secondary MDTs register before MDT 0

It might be OK to start OSTs with secondary MDTs, though -- should we,
or is delaying OK?

Change-Id: I865e5a386d7d486d12c3d4e2459bed1cafb7d4fe
Only implemented for tune right now as it is the most easily problematic

Change-Id: I0e7326a36214a0e01ad7aec8038b54b74d73fab5
The change to python setup entry script made shine lose its return code
value (used to be sys.exit())
Adding a return statement makes the script properly exit with error code.

Fixes: b398a56 ("packaging: lots of cleaning of shine distribution")
Change-Id: I12c9e3e570bb8e5fcde88746bf2ed8811a97a72d
Allow same fs to be mounted multiple times on a client
Will be removed when we're done with store_ct

Change-Id: Id12c7be3b78b66dcf2303bb0edcefb7b4f7b2a36
@rezib rezib changed the title Rebased updates from stanford and CEA Rebased updates from Stanford and CEA Apr 22, 2025
@cedeyn cedeyn self-assigned this Apr 29, 2025
rezib added 2 commits May 15, 2025 15:25
GitHub Actions workers are slow, this test happens to need more than 3
seconds to complete.
Accept DNS error message reported by scp in Ubuntu on GitHub Actions
workers.
@cedeyn cedeyn merged commit 06d760c into cea-hpc:master Jun 3, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants