Rebased updates from Stanford and CEA by rezib · Pull Request #226 · cea-hpc/shine

rezib · 2025-04-22T15:18:37Z

This pull request contains all updates retrieved from Stanford and CEA production branch, including Python 3 support and Shine-HA feature.

Edit: 2 additional commits have been pushed to make unit tests successful in GitHub Actions environments. Successful execution of CI can be found here: https://github.com/rezib/shine/actions/runs/15046341361/job/42289732238

In Python 3, __import__ expect an absolute module name. This fixes "ModuleNotFoundError: No module named 'File'" error for the following tests: - target with ha_node - target with several ha_nodes - test_index_external (Configuration.BackendFileTest.BackendFileTest .test_index_external) - test_multiple_matches (Configuration.BackendFileTest.BackendFileTest .test_multiple_matches) - test_simple_mgs (Configuration.BackendFileTest.BackendFileTest .test_simple_mgs) Change-Id: If8306c096c25e7486d128fdf72b9a44f883898bc

This commit fixes "TypeError: unhashable type: 'NodeSet'" error raised in the following tests: - Install a non-existent file is correctly reported - Install to a bad node is correctly reported - Install to a mix of bad and good nodes is correctly reported - Install a simple file - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message - install on unreachable nodes raises an error Change-Id: Ia7625b1da6436543e8f9ec8cb8bebbd4004a3537

This commit fixes "TypeError: startswith first arg must be bytes or a tuple of bytes, not str" in the following tests: - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message This also fixes "TypeError: a bytes-like object is required, not 'str'" error in this test: - send a forged message which fails due to bad pickle content Change-Id: Ic2cf841306d3689d48979311d72bb9435371aa06

Starting with version 1.8, ClusterShell MsgTreeElem __str__() method raise an error. We have to call .message().decode() to get the string. Also fix MsgTree .add() expecting bytes. The fixes "TypeError: cannot get string from MsgTreeElem, use bytes instead" on the following tests: - Install a non-existent file is correctly reported - Install to a bad node is correctly reported - Install to a mix of bad and good nodes is correctly reported - Install a simple file - install on unreachable nodes raises an error - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message - send a done message which fails update due to bad property - send a forged message which fails due to bad pickle content This fixes "TypeError: sequence item 0: expected a bytes-like object, str found" on the following tests: - send a done message which fails update due to bad property - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a forged message which fails due to bad pickle content This fixes "TypeError: string argument without an encoding" on the following tests: - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes Change-Id: I8ecf1797bfb7879d434576770fc1232846088d6a

Python 3 does not support sorted() with None values anymore. Replace None values with -inf instead. Note that float('-inf') is preferred over -math.inf to keep Python 2.7 compatibility. This fixes "TypeError: '<' not supported between instances of 'int' and 'NoneType'" raised in the following tests: - test failover node has a state and all others have none - test master node has a state and all others have none - test component started on two different nodes Change-Id: I93f751889d00580ba700f31c656a2f102bdae880

Dictionaries are insertion ordered in Python 3.6+, this changes expected results for these tests. Also use collections.OrderedDict() in order to keep consistent results with Python < 3.6. Change-Id: Icce89e2fde77bb64cda956b0202181879fbe80ca

This test expect integer division to get expected result, use integer division operator to avoid getting float value that makes the test fail with "AssertionError: [6] != None" eventually. Change-Id: Iaf493ee8c27f244f443edbc4c140ec49aba6d9da

Fix regression introduced in e3e5be4 that makes sort_key argument a requirement to get predictive sorted output. This notably fixes random AssertionError failures the following test: - fill with a support filter Change-Id: Ic50bf0b2e52770d654c9fd7a07251b89ba04d4ee

Instead of checking against hard-coded UnpicklingError message that is subject to change between Python versions, check against a more generic regex to simply check the error is properly managed by shine_msg_unpack(). For reference with Python 3.6, the error is "unpickling stack underflow" instead of "pop from empty list". Change-Id: I2d9d54c1137157df94069fad11b9dcc98a2fa608

Sets are not ordered in Python 3. In TuningParameter class, it has the effect of producing unpredictable results for __str__() method. The _node_types attribute is converted to a list to avoid spurious random failure with "AssertionError: 'foo=0 types=client,mds nodes=toto15' != 'foo=0 types=mds,client nodes=toto15'" in the following test: - test TuningParameter.__str__() Change-Id: Ia99b2e622d11d95f96a9e5fd78bf3c7c0e6d9bca

Sets in Python 3 are not ordered, thus producing unpredictible results in orders of modules in the graph. This is fixed by converting into a list and checking for duplicate values before insertion. This fixes the following tests that failed randomly with "AssertionError: 'ldiskfs' != 'lustre'": - prepare is ok with or without tunings - prepare a simple action on a local component Change-Id: I2f1ba3c5cb2cbfb4cf543014282c5d1939a18d6a

Update the test to adopt standard assertRaisesRegex[p] method and update the regex to accept error message generated on el8. This fixes the following test: ====================================================================== FAIL: install on unreachable nodes raises an error ---------------------------------------------------------------------- Traceback (most recent call last): File "shine/tests/Lustre/FileSystemTest.py", line 156, in test_install_unreachable fs.install(fs_config_file=Utils.makeTempFilename()) Shine.Lustre.FileSystem.FSRemoteError: badnode[1-2]: Copy failed: ssh: Could not resolve hostname badnode[1-2]: No address associated with hostname lost connection [rc=1] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "shine/tests/Lustre/FileSystemTest.py", line 162, in test_install_unreachable self.assertTrue(str(ex).endswith("badnode[1-2]: Name or service not" AssertionError: False is not true Change-Id: I1606b3eb04b6f7cd23d2150be8bda3cbf6d85196

ModelFile supports spaces in values when linesep='\n' but not when linesep=' '. This patch adds quote support to allow constructions like: foo: bar="my test" id=7 Change-Id: I18c833cc40c4b09ba6ac0414c520c9c9c553a017

This is a first implementation of device start/stop actions that are defined in the model file as dev_action and associated with any targets using dev_run=<action_alias>. If dev_run is defined for a target, its device doesn't need to be present on start, as the start command of the associated dev_action will be executed prior to the target mount to allow the device to be created. For stop, the Lustre target is first unmounted, and then the device stop action is called. Change-Id: I2afe0bc257e5c6742acee4ed7ea7e308126a7e66

Add a new target status: no_device when the target device is not available on the host. A missing device is not always an error, like with mdadm. Change-Id: Ifd402c6b5b4f827995e5bc664a1a5ab3c8fe8ebf

Change-Id: If74f6f277a6cc79851ecf5c0d9cda59c1498dc58

The 'shine' remote executable path was based on sys.argv[0] and thus the Shine module can only work with the provided 'shine' CLI. Please note that with this change, 'shine' must be found in the remote node $PATH. Change-Id: I485d7ec70ea2ecbf41596565243a72820ec81327

Change-Id: Ic33aa9adb393a434615be18153a55f77e3cad368

This patch introduces shine-HA, an extra tool to monitoring, alerting and performing automatic Lustre target failover. This first patch supports monitoring and alerting of Lustre targets. Please take a look at conf/ha.yaml as an example of configuration file. All defined parameters are supported by the current code. Two Alert plugins are already provided: local emails and Slack notifications. Additional Alert plugins may be installed in Shine/HA/plugins/. Three alert levels can be configured: INFO, WARN and CRIT. Alerts on transcient Lustre target states can be avoided by tuning the thresholds found in the fs_monitor_state_count_thresholds section in ha.yaml. Change-Id: I222ea0b26ba9cd1d23c98fa74c780bcae562ebe1

Change-Id: I94f86187f5ceca583ba9442757650aa616d7a6c2

Change-Id: I17eae7096ea068539b473c2aa91ae0bc0d43be0b

Change-Id: Ib00080289466d014a8218f18790d1ba8ce7ec7ea

In config (ha.yaml): - custom pinger command - command timeout option - alert thresholds Change-Id: I8f62a79b670d3baaa770da3fd312c042696d14ab

Change-Id: Icf67c14240049d14953f6d4d81fcf834f32f8782

This patch adds basic Lustre HA feature to shine-HA. It is now able to perform fencing of a server having all targets in errors and lnet unreachable (eg. server crash). If the fence command is successful, HACore will perform a failover of the targets through the shine API, of both resident and non-resident targets on the selected failover servers. Change-Id: I9adf723e24f57b5418ce577241f98bcaf6b4cb78

- keep track of current StatusThread - add support for action thread invalidation - invalidate StatusThread if still active after polling_interval - retry status after invalidation instead of waiting forever Change-Id: I40c91cc92e008009898c0d7c06487c0a495114fd

Change-Id: Icf4f18c3948e4733bd4f0270f190974f3acddc28

Method execute_fs() should not return None. Change-Id: I5f9ba34c8a944b8e41df3c413f9844b19bedc1cf

Change-Id: Iea9bf672bf803f71656388e1709e251353eb87e6

Change-Id: I1c8e97183c74a2203c51ab76a6cb6b7f89cf4fe5

Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ia6fc00d38349bc7eb9a21f5efca4e4022ae719f1

/sys/fs/lustre/*/%s/mntdev also catches /sys/fs/lustre/mgs/%s/mntdev which we don't want. Change glob pattern to: /sys/fs/lustre/osd-*/%s/mntdev Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: I12faaa57bd3ab0931faf505960b749d297914d58

Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ie19775cd4f97edf988dce2ff3049a4b4a6f020c8

Shine now tries to mount both lustre_tgt and lustre filesystem types. With this option, the mount command initially tries to execute mount.lustre_tgt. When this command is not found, it fallbacks to mount.lustre. With this option, Shine is both future proof while keeping backward compatibility. Note that lustre_tgt is set in the first place as it is more prone to succeed in the future. Co-authored-by: Rémi Palancher <remi@rackslab.io> Change-Id: Ibed0c8936ca0c16da26cb48577973d1408db897d

Change-Id: I7eff3e77580b562aed810516c9dbe4c38b1005a3

If lnet_conf is set, (un)configure the lnet network like the lnet service would after loading/before unloading modules. Closes #211. Change-Id: I9be102223248364e9b905f38e89eb9583e1f2d4a

Trying to figure out which target to start first based on flags is horrible with "extended" mode, because that status is done on all nodes when it's not necessary nor actually always possible Change-Id: I59f2115cbd444b51b795bdf0c1d7b0f9e8c5fe63

This is a bit intrusive but will eventually be necessary to not check all failover targets on status for example Change-Id: I4dcdf855e7fdec88cfe95fa12dabd0aa40001fe6

…evel Status doesn't actually need to check disk level details for most usages: - when the filesystem is started, infos from /proc is usually enough - even if it is offline, infos displayed don't require flags/etc unless the formatting option requires it (-V target or -O %flags) Change-Id: I1a64a0f490bb287f965607d177e0d6511ca980f9

Change-Id: I9f0a10bee0cb7d3282d60142cec79c864f7731cb

Lets us not unload modules after stop or umount. This is useful for e.g. HA targets where we do not want to unload modules on voluntary migration Change-Id: I96e196b077812f0e18e590507a20d6feaaa65172

There are problems with writeconf if secondary MDTs register before MDT 0 It might be OK to start OSTs with secondary MDTs, though -- should we, or is delaying OK? Change-Id: I865e5a386d7d486d12c3d4e2459bed1cafb7d4fe

Only implemented for tune right now as it is the most easily problematic Change-Id: I0e7326a36214a0e01ad7aec8038b54b74d73fab5

The change to python setup entry script made shine lose its return code value (used to be sys.exit()) Adding a return statement makes the script properly exit with error code. Fixes: b398a56 ("packaging: lots of cleaning of shine distribution") Change-Id: I12c9e3e570bb8e5fcde88746bf2ed8811a97a72d

Allow same fs to be mounted multiple times on a client Will be removed when we're done with store_ct Change-Id: Id12c7be3b78b66dcf2303bb0edcefb7b4f7b2a36

GitHub Actions workers are slow, this test happens to need more than 3 seconds to complete.

Accept DNS error message reported by scp in Ubuntu on GitHub Actions workers.

rezib and others added 30 commits October 8, 2024 16:53

Tests: use integer division operator

ba57a10

This test expect integer division to get expected result, use integer division operator to avoid getting float value that makes the test fail with "AssertionError: [6] != None" eventually. Change-Id: Iaf493ee8c27f244f443edbc4c140ec49aba6d9da

ModelFile: add support for quoted values for inline ModelFile

05fd4c7

ModelFile supports spaces in values when linesep='\n' but not when linesep=' '. This patch adds quote support to allow constructions like: foo: bar="my test" id=7 Change-Id: I18c833cc40c4b09ba6ac0414c520c9c9c553a017

New target status: "no_device"

28709f2

Add a new target status: no_device when the target device is not available on the host. A missing device is not always an error, like with mdadm. Change-Id: Ifd402c6b5b4f827995e5bc664a1a5ab3c8fe8ebf

Fsck: start device if dev_action/dev_run is used

87d172c

Change-Id: If74f6f277a6cc79851ecf5c0d9cda59c1498dc58

FSProxyAction: use logging instead of print

741a5b2

Change-Id: Ic33aa9adb393a434615be18153a55f77e3cad368

shine-HA: add test library and first test scenario

eb98e46

Change-Id: I94f86187f5ceca583ba9442757650aa616d7a6c2

shine-HA: add shine-ha.service and update specfile

3a17431

Change-Id: I17eae7096ea068539b473c2aa91ae0bc0d43be0b

shine-HA: fix AlertManager issue with tests

a29e2a3

Change-Id: Ib00080289466d014a8218f18790d1ba8ce7ec7ea

shine-HA: add LNet NIDs monitoring

2d746b9

In config (ha.yaml): - custom pinger command - command timeout option - alert thresholds Change-Id: I8f62a79b670d3baaa770da3fd312c042696d14ab

HA.plugins: use TextTable to display filesystem status

a5e5143

Change-Id: Icf67c14240049d14953f6d4d81fcf834f32f8782

Port HA code to Python 3 (+ spec, setup.py, mkrelease.sh)

6854190

Change-Id: Icf4f18c3948e4733bd4f0270f190974f3acddc28

Python 3: fix missing return code in Commands/Config.py

dd9a337

Method execute_fs() should not return None. Change-Id: I5f9ba34c8a944b8e41df3c413f9844b19bedc1cf

Python 3: result.duration may be None, add explicit check

3ef8b48

Change-Id: Iea9bf672bf803f71656388e1709e251353eb87e6

Fix various issues with Python 3.9.16 / EL9.2

1be5e6a

Change-Id: I1c8e97183c74a2203c51ab76a6cb6b7f89cf4fe5

thiell and others added 15 commits October 11, 2024 14:58

Server: implement ordering (string based) to allow sort

67695e3

Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ia6fc00d38349bc7eb9a21f5efca4e4022ae719f1

Command: fix MsgTree message handling for Python 3

f5799d8

Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ie19775cd4f97edf988dce2ff3049a4b4a6f020c8

Lustre/Target.py: support lustre_tgt

d2a8af7

Change-Id: I7eff3e77580b562aed810516c9dbe4c38b1005a3

Ticket #211: Add lnet_conf config option

074d30e

If lnet_conf is set, (un)configure the lnet network like the lnet service would after loading/before unloading modules. Closes #211. Change-Id: I9be102223248364e9b905f38e89eb9583e1f2d4a

Lustre.Target: make mountdata multiple choice

4d49ef6

This is a bit intrusive but will eventually be necessary to not check all failover targets on status for example Change-Id: I4dcdf855e7fdec88cfe95fa12dabd0aa40001fe6

status: add option --no-ha to not check HA

32dea28

Change-Id: I9f0a10bee0cb7d3282d60142cec79c864f7731cb

Lustre: add a --nounload option

0f59036

Lets us not unload modules after stop or umount. This is useful for e.g. HA targets where we do not want to unload modules on voluntary migration Change-Id: I96e196b077812f0e18e590507a20d6feaaa65172

Start order: Always start MDT-0000 before other MDTs, too

d5ab592

There are problems with writeconf if secondary MDTs register before MDT 0 It might be OK to start OSTs with secondary MDTs, though -- should we, or is delaying OK? Change-Id: I865e5a386d7d486d12c3d4e2459bed1cafb7d4fe

Dry-run: do not call set_status directly from _launch

0ba73e1

Only implemented for tune right now as it is the most easily problematic Change-Id: I0e7326a36214a0e01ad7aec8038b54b74d73fab5

LOCAL Client.py enhance multimount of same fs

cc3d4c6

Allow same fs to be mounted multiple times on a client Will be removed when we're done with store_ct Change-Id: Id12c7be3b78b66dcf2303bb0edcefb7b4f7b2a36

rezib changed the title ~~Rebased updates from stanford and CEA~~ Rebased updates from Stanford and CEA Apr 22, 2025

cedeyn self-assigned this Apr 29, 2025

rezib added 2 commits May 15, 2025 15:25

tests: increase nid map threashold

8dae905

GitHub Actions workers are slow, this test happens to need more than 3 seconds to complete.

tests: add possible scp error message

83e8cc7

Accept DNS error message reported by scp in Ubuntu on GitHub Actions workers.

cedeyn merged commit 06d760c into cea-hpc:master Jun 3, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebased updates from Stanford and CEA#226

Rebased updates from Stanford and CEA#226
cedeyn merged 47 commits into
cea-hpc:masterfrom
rezib:pr/all-updates

rezib commented Apr 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rezib commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rezib commented Apr 22, 2025 •

edited

Loading