Rebased updates from Stanford and CEA#226
Merged
Merged
Conversation
In Python 3, __import__ expect an absolute module name. This fixes
"ModuleNotFoundError: No module named 'File'" error for the following
tests:
- target with ha_node
- target with several ha_nodes
- test_index_external (Configuration.BackendFileTest.BackendFileTest
.test_index_external)
- test_multiple_matches (Configuration.BackendFileTest.BackendFileTest
.test_multiple_matches)
- test_simple_mgs (Configuration.BackendFileTest.BackendFileTest
.test_simple_mgs)
Change-Id: If8306c096c25e7486d128fdf72b9a44f883898bc
This commit fixes "TypeError: unhashable type: 'NodeSet'" error raised in the following tests: - Install a non-existent file is correctly reported - Install to a bad node is correctly reported - Install to a mix of bad and good nodes is correctly reported - Install a simple file - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message - install on unreachable nodes raises an error Change-Id: Ia7625b1da6436543e8f9ec8cb8bebbd4004a3537
This commit fixes "TypeError: startswith first arg must be bytes or a tuple of bytes, not str" in the following tests: - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message This also fixes "TypeError: a bytes-like object is required, not 'str'" error in this test: - send a forged message which fails due to bad pickle content Change-Id: Ic2cf841306d3689d48979311d72bb9435371aa06
Starting with version 1.8, ClusterShell MsgTreeElem __str__() method raise an error. We have to call .message().decode() to get the string. Also fix MsgTree .add() expecting bytes. The fixes "TypeError: cannot get string from MsgTreeElem, use bytes instead" on the following tests: - Install a non-existent file is correctly reported - Install to a bad node is correctly reported - Install to a mix of bad and good nodes is correctly reported - Install a simple file - install on unreachable nodes raises an error - send a done message which fails update due to bad property - message with compname value is backward compatible - message shine version 2 is compatible - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a start and done message - send a done message which fails update due to bad property - send a forged message which fails due to bad pickle content This fixes "TypeError: sequence item 0: expected a bytes-like object, str found" on the following tests: - send a done message which fails update due to bad property - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes - send a forged message which fails due to bad pickle content This fixes "TypeError: string argument without an encoding" on the following tests: - send a start and done message and then crashes - simulate unable to run python - send a start message then crashes Change-Id: I8ecf1797bfb7879d434576770fc1232846088d6a
Python 3 does not support sorted() with None values anymore. Replace
None values with -inf instead. Note that float('-inf') is preferred over
-math.inf to keep Python 2.7 compatibility.
This fixes "TypeError: '<' not supported between instances of 'int' and
'NoneType'" raised in the following tests:
- test failover node has a state and all others have none
- test master node has a state and all others have none
- test component started on two different nodes
Change-Id: I93f751889d00580ba700f31c656a2f102bdae880
Dictionaries are insertion ordered in Python 3.6+, this changes expected results for these tests. Also use collections.OrderedDict() in order to keep consistent results with Python < 3.6. Change-Id: Icce89e2fde77bb64cda956b0202181879fbe80ca
This test expect integer division to get expected result, use integer division operator to avoid getting float value that makes the test fail with "AssertionError: [6] != None" eventually. Change-Id: Iaf493ee8c27f244f443edbc4c140ec49aba6d9da
Fix regression introduced in e3e5be4 that makes sort_key argument a requirement to get predictive sorted output. This notably fixes random AssertionError failures the following test: - fill with a support filter Change-Id: Ic50bf0b2e52770d654c9fd7a07251b89ba04d4ee
Instead of checking against hard-coded UnpicklingError message that is subject to change between Python versions, check against a more generic regex to simply check the error is properly managed by shine_msg_unpack(). For reference with Python 3.6, the error is "unpickling stack underflow" instead of "pop from empty list". Change-Id: I2d9d54c1137157df94069fad11b9dcc98a2fa608
Sets are not ordered in Python 3. In TuningParameter class, it has the effect of producing unpredictable results for __str__() method. The _node_types attribute is converted to a list to avoid spurious random failure with "AssertionError: 'foo=0 types=client,mds nodes=toto15' != 'foo=0 types=mds,client nodes=toto15'" in the following test: - test TuningParameter.__str__() Change-Id: Ia99b2e622d11d95f96a9e5fd78bf3c7c0e6d9bca
Sets in Python 3 are not ordered, thus producing unpredictible results in orders of modules in the graph. This is fixed by converting into a list and checking for duplicate values before insertion. This fixes the following tests that failed randomly with "AssertionError: 'ldiskfs' != 'lustre'": - prepare is ok with or without tunings - prepare a simple action on a local component Change-Id: I2f1ba3c5cb2cbfb4cf543014282c5d1939a18d6a
Update the test to adopt standard assertRaisesRegex[p] method and update
the regex to accept error message generated on el8. This fixes the
following test:
======================================================================
FAIL: install on unreachable nodes raises an error
----------------------------------------------------------------------
Traceback (most recent call last):
File "shine/tests/Lustre/FileSystemTest.py", line 156, in
test_install_unreachable
fs.install(fs_config_file=Utils.makeTempFilename())
Shine.Lustre.FileSystem.FSRemoteError: badnode[1-2]: Copy failed: ssh:
Could not resolve hostname badnode[1-2]: No address associated with
hostname
lost connection [rc=1]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "shine/tests/Lustre/FileSystemTest.py", line 162, in
test_install_unreachable
self.assertTrue(str(ex).endswith("badnode[1-2]: Name or service not"
AssertionError: False is not true
Change-Id: I1606b3eb04b6f7cd23d2150be8bda3cbf6d85196
ModelFile supports spaces in values when linesep='\n' but not when
linesep=' '. This patch adds quote support to allow constructions like:
foo: bar="my test" id=7
Change-Id: I18c833cc40c4b09ba6ac0414c520c9c9c553a017
This is a first implementation of device start/stop actions that are defined in the model file as dev_action and associated with any targets using dev_run=<action_alias>. If dev_run is defined for a target, its device doesn't need to be present on start, as the start command of the associated dev_action will be executed prior to the target mount to allow the device to be created. For stop, the Lustre target is first unmounted, and then the device stop action is called. Change-Id: I2afe0bc257e5c6742acee4ed7ea7e308126a7e66
Add a new target status: no_device when the target device is not available on the host. A missing device is not always an error, like with mdadm. Change-Id: Ifd402c6b5b4f827995e5bc664a1a5ab3c8fe8ebf
Change-Id: If74f6f277a6cc79851ecf5c0d9cda59c1498dc58
The 'shine' remote executable path was based on sys.argv[0] and thus the Shine module can only work with the provided 'shine' CLI. Please note that with this change, 'shine' must be found in the remote node $PATH. Change-Id: I485d7ec70ea2ecbf41596565243a72820ec81327
Change-Id: Ic33aa9adb393a434615be18153a55f77e3cad368
This patch introduces shine-HA, an extra tool to monitoring, alerting and performing automatic Lustre target failover. This first patch supports monitoring and alerting of Lustre targets. Please take a look at conf/ha.yaml as an example of configuration file. All defined parameters are supported by the current code. Two Alert plugins are already provided: local emails and Slack notifications. Additional Alert plugins may be installed in Shine/HA/plugins/. Three alert levels can be configured: INFO, WARN and CRIT. Alerts on transcient Lustre target states can be avoided by tuning the thresholds found in the fs_monitor_state_count_thresholds section in ha.yaml. Change-Id: I222ea0b26ba9cd1d23c98fa74c780bcae562ebe1
Change-Id: I94f86187f5ceca583ba9442757650aa616d7a6c2
Change-Id: I17eae7096ea068539b473c2aa91ae0bc0d43be0b
Change-Id: Ib00080289466d014a8218f18790d1ba8ce7ec7ea
In config (ha.yaml): - custom pinger command - command timeout option - alert thresholds Change-Id: I8f62a79b670d3baaa770da3fd312c042696d14ab
Change-Id: Icf67c14240049d14953f6d4d81fcf834f32f8782
This patch adds basic Lustre HA feature to shine-HA. It is now able to perform fencing of a server having all targets in errors and lnet unreachable (eg. server crash). If the fence command is successful, HACore will perform a failover of the targets through the shine API, of both resident and non-resident targets on the selected failover servers. Change-Id: I9adf723e24f57b5418ce577241f98bcaf6b4cb78
- keep track of current StatusThread - add support for action thread invalidation - invalidate StatusThread if still active after polling_interval - retry status after invalidation instead of waiting forever Change-Id: I40c91cc92e008009898c0d7c06487c0a495114fd
Change-Id: Icf4f18c3948e4733bd4f0270f190974f3acddc28
Method execute_fs() should not return None. Change-Id: I5f9ba34c8a944b8e41df3c413f9844b19bedc1cf
Change-Id: Iea9bf672bf803f71656388e1709e251353eb87e6
Change-Id: I1c8e97183c74a2203c51ab76a6cb6b7f89cf4fe5
Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ia6fc00d38349bc7eb9a21f5efca4e4022ae719f1
/sys/fs/lustre/*/%s/mntdev also catches /sys/fs/lustre/mgs/%s/mntdev which we don't want. Change glob pattern to: /sys/fs/lustre/osd-*/%s/mntdev Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: I12faaa57bd3ab0931faf505960b749d297914d58
Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: Ie19775cd4f97edf988dce2ff3049a4b4a6f020c8
Shine now tries to mount both lustre_tgt and lustre filesystem types. With this option, the mount command initially tries to execute mount.lustre_tgt. When this command is not found, it fallbacks to mount.lustre. With this option, Shine is both future proof while keeping backward compatibility. Note that lustre_tgt is set in the first place as it is more prone to succeed in the future. Co-authored-by: Rémi Palancher <remi@rackslab.io> Change-Id: Ibed0c8936ca0c16da26cb48577973d1408db897d
Change-Id: I7eff3e77580b562aed810516c9dbe4c38b1005a3
If lnet_conf is set, (un)configure the lnet network like the lnet service would after loading/before unloading modules. Closes #211. Change-Id: I9be102223248364e9b905f38e89eb9583e1f2d4a
Trying to figure out which target to start first based on flags is horrible with "extended" mode, because that status is done on all nodes when it's not necessary nor actually always possible Change-Id: I59f2115cbd444b51b795bdf0c1d7b0f9e8c5fe63
This is a bit intrusive but will eventually be necessary to not check all failover targets on status for example Change-Id: I4dcdf855e7fdec88cfe95fa12dabd0aa40001fe6
…evel Status doesn't actually need to check disk level details for most usages: - when the filesystem is started, infos from /proc is usually enough - even if it is offline, infos displayed don't require flags/etc unless the formatting option requires it (-V target or -O %flags) Change-Id: I1a64a0f490bb287f965607d177e0d6511ca980f9
Change-Id: I9f0a10bee0cb7d3282d60142cec79c864f7731cb
Lets us not unload modules after stop or umount. This is useful for e.g. HA targets where we do not want to unload modules on voluntary migration Change-Id: I96e196b077812f0e18e590507a20d6feaaa65172
There are problems with writeconf if secondary MDTs register before MDT 0 It might be OK to start OSTs with secondary MDTs, though -- should we, or is delaying OK? Change-Id: I865e5a386d7d486d12c3d4e2459bed1cafb7d4fe
Only implemented for tune right now as it is the most easily problematic Change-Id: I0e7326a36214a0e01ad7aec8038b54b74d73fab5
The change to python setup entry script made shine lose its return code value (used to be sys.exit()) Adding a return statement makes the script properly exit with error code. Fixes: b398a56 ("packaging: lots of cleaning of shine distribution") Change-Id: I12c9e3e570bb8e5fcde88746bf2ed8811a97a72d
Allow same fs to be mounted multiple times on a client Will be removed when we're done with store_ct Change-Id: Id12c7be3b78b66dcf2303bb0edcefb7b4f7b2a36
GitHub Actions workers are slow, this test happens to need more than 3 seconds to complete.
Accept DNS error message reported by scp in Ubuntu on GitHub Actions workers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request contains all updates retrieved from Stanford and CEA production branch, including Python 3 support and Shine-HA feature.
Edit: 2 additional commits have been pushed to make unit tests successful in GitHub Actions environments. Successful execution of CI can be found here: https://github.com/rezib/shine/actions/runs/15046341361/job/42289732238