Skip to content

CancelledError and TimeoutError during client.shutdown() due to unnecessary AMM retirement when futures are not explicitly released #9300

@hebian1994

Description

@hebian1994

TimeoutError during client.shutdown() due to unnecessary AMM retirement when futures are not explicitly released

Description

When shutting down a dedicated LocalCluster via client.shutdown() after batch processing is complete, Dask still executes the full graceful worker retirement path (retire_workers + Active Memory Manager replication/drop), even though all workers are being removed and the results are no longer needed.

If the client has not released its futures first, completed task data remains in the scheduler's state memory on the workers. During shutdown, the AMM attempts to replicate or drop those keys across workers that are simultaneously shutting down. This causes unnecessary network/memory overhead and slows down worker teardown. Consequently, this can exceed the Nanny’s default process.join timeout (~4s) and surface as a TimeoutError / Tornado ERROR log during SpecCluster._correct_state_internal().

Releasing all client-held keys before cluster.close() avoids the problem in practice, which suggests the current shutdown path is doing redundant work that a full-cluster teardown should not require.

Sample Code from My Project

Create cluster and client
cls.cluster = LocalCluster(
    name="cluster", 
    n_workers=cls.n_workers, 
    threads_per_worker=cls.threads_per_worker, 
    memory_limit=0, 
    dashboard_address=dashboard_address
)
cls.client = Client(
    name="client", 
    address=cls.cluster, 
    direct_to_workers=True
)

Shutdown
@classmethod
def system_teardown(cls):
    with suppress(Exception):
        cls.client.shutdown()

Error Stack Trace

Occasionally, the following error occurs during shutdown:

2026-06-10 14:56:47,407 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x00000288583A1D30>>, <Task finished name='Task-1174981' coro=<Spec Cluster._correct_state_internal() done, defined at .venvLibsite-packagesdistributeddeployspec.py:346> exception=TimeoutError()>)
Traceback (most recent call last):
  File ".venvLibsite-packagesdistributedutils.py", line 1910, in wait_for
    return await fut
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".venvLibsite-packagestornadoioloop.py", line 758, in _run_callback
    ret = callback()
  File ".venvLibsite-packagestornadoioloop.py", line 782, in _discard_future_result
    future.result()
TimeoutError

Root Cause Analysis

When client.shutdown() is called without prior key release, completed task data remains in the scheduler's state memory on the workers. During the shutdown sequence, the Active Memory Manager (AMM) attempts to replicate or drop these keys across workers that are concurrently being torn down. This triggers unnecessary network/memory operations and significantly slows down the worker teardown process. Ultimately, this exceeds the Nanny’s default process.join timeout (~4s), resulting in a TimeoutError within SpecCluster._correct_state_internal().

Suggestions / Recommendations

Optimize Full-Cluster Teardown: Consider bypassing or short-circuiting the AMM retirement logic (retire_workers, data replication/drop) when a full-cluster teardown is initiated via client.shutdown(). Since all workers are being removed anyway, preserving or migrating their data is redundant.

Auto-Release on Shutdown: Alternatively, automatically release all client-held keys before initiating the shutdown sequence to prevent the AMM from acting on stale future references.

Documentation: In the meantime, it might be helpful to document this behavior and recommend users explicitly release their futures (e.g., via client.release() or iterating over client.futures) before calling shutdown() as a best practice.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions