test_worker_metrics.py::test_gather_dep_network_error has started being heavily flaky recently.
The failure has nothing to do with metrics; a Worker.gather_dep method that fails to open a new RPC channel to its peer seems now to be left dangling forever.
@gen_cluster(
client=True,
nthreads=[("", 1)],
config={"distributed.comm.timeouts.connect": "500ms"},
)
async def test_gather_dep_network_error(c, s, a):
x = c.submit(inc, 1, key="x")
await wait(x)
async with BlockedGatherDep(s.address) as b:
y = c.submit(inc, x, key="y", workers=[b.address])
await b.in_gather_dep.wait() # <-- Before b tries to connect to a
await a.close()
b.block_gather_dep.set() # <-- b will now attempt and gracefully fail RPC call to a.get_data()
await wait(y) # <--- Times out here
This may be (happy to be proven wrong) a major regression that could deadlock entire clusters. I suggest treating it as a blocker for the next release.
This also shows we are missing a dedicated unit test; this regression should not have been fortuitously caught by a test that is about metrics.
The test report seems to strongly correlate it to #7969:

CC @graingert @jrbourbeau @fjetter @hendrikmakait
test_worker_metrics.py::test_gather_dep_network_errorhas started being heavily flaky recently.The failure has nothing to do with metrics; a
Worker.gather_depmethod that fails to open a new RPC channel to its peer seems now to be left dangling forever.This may be (happy to be proven wrong) a major regression that could deadlock entire clusters. I suggest treating it as a blocker for the next release.
This also shows we are missing a dedicated unit test; this regression should not have been fortuitously caught by a test that is about metrics.
The test report seems to strongly correlate it to #7969:
CC @graingert @jrbourbeau @fjetter @hendrikmakait