Skip to content

Regression: frequent deadlocks when gather_dep fails to contact peer #8006

@crusaderky

Description

@crusaderky

test_worker_metrics.py::test_gather_dep_network_error has started being heavily flaky recently.
The failure has nothing to do with metrics; a Worker.gather_dep method that fails to open a new RPC channel to its peer seems now to be left dangling forever.

@gen_cluster(
    client=True,
    nthreads=[("", 1)],
    config={"distributed.comm.timeouts.connect": "500ms"},
)
async def test_gather_dep_network_error(c, s, a):
    x = c.submit(inc, 1, key="x")
    await wait(x)
    async with BlockedGatherDep(s.address) as b:
        y = c.submit(inc, x, key="y", workers=[b.address])
        await b.in_gather_dep.wait()   # <-- Before b tries to connect to a
        await a.close()
        b.block_gather_dep.set()  # <-- b will now attempt and gracefully fail RPC call to a.get_data()
        await wait(y)  # <--- Times out here

This may be (happy to be proven wrong) a major regression that could deadlock entire clusters. I suggest treating it as a blocker for the next release.

This also shows we are missing a dedicated unit test; this regression should not have been fortuitously caught by a test that is about metrics.

The test report seems to strongly correlate it to #7969:

image

CC @graingert @jrbourbeau @fjetter @hendrikmakait

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progressregression

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions