Skip to content

Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028

Merged
martindurant merged 13 commits into
fsspec:masterfrom
yuxin00j:fix-cache-thread-id
May 16, 2026
Merged

Fix: Omit threading.get_ident() from cache token for async implementations in sync mode#2028
martindurant merged 13 commits into
fsspec:masterfrom
yuxin00j:fix-cache-thread-id

Conversation

@yuxin00j

Copy link
Copy Markdown
Contributor

The Problem:
Currently, the AbstractFileSystem instance cache incorporates threading.get_ident() into its cache token. For workloads using multi-threading (like Hugging Face datasets fetching metadata), this completely bypasses the cache, forcing fsspec to create a brand-new filesystem instance and aiohttp.ClientSession for every single thread. This results in redundant authentication and connection overhead, defeating the purpose of the cache.

The Solution:
As discussed in #2020, because fsspec currently uses a single global fsspecIO event loop for all synchronous calls (asynchronous=False), all sync() coroutine executions are inherently funneled into that single background thread. Therefore, multiple calling Python threads can perfectly and safely share the exact same AsyncFileSystem instance.

This PR modifies fsspec/spec.py to conditionally remove threading.get_ident() from the caching token if:

  1. The class uses an async implementation (async_impl = True).
  2. The user requests a synchronous interface (asynchronous=False).

For purely synchronous filesystems (e.g., FTP) or when asynchronous=True, the thread ID remains in the cache token to ensure thread safety and avoid cross-loop boundary violations.

Changes:

  • Conditionally generated the cache token in _Cached.__call__.
  • Added tests in fsspec/tests/test_cache_threads.py to explicitly verify the caching behavior across multiple threads for both AsyncFileSystem and AbstractFileSystem under different modes.

Closes #2020

@yuxin00j yuxin00j force-pushed the fix-cache-thread-id branch from ec015b2 to 8f896d7 Compare April 30, 2026 12:06
yuxin00j added a commit to yuxin00j/filesystem_spec that referenced this pull request May 5, 2026
yuxin00j added a commit to yuxin00j/filesystem_spec that referenced this pull request May 5, 2026
@yuxin00j yuxin00j force-pushed the fix-cache-thread-id branch from 05366bf to 015a663 Compare May 5, 2026 04:27
@yuxin00j

Copy link
Copy Markdown
Contributor Author

Hi @martindurant, can we retry the CI for this PR? Thanks.

@martindurant

Copy link
Copy Markdown
Member

It's your specific new tests that are failing, but only for older python versions.

@yuxin00j

Copy link
Copy Markdown
Contributor Author

Hi @martindurant, I have updated the test. Could you help run the CI again?

@martindurant

Copy link
Copy Markdown
Member

We seem to be missing some test dependencies in the windows CI env. Maybe they should all be in the [test] env define in pyproject. I added pytest-asyncio directly to the env, but there are more.

@martindurant

Copy link
Copy Markdown
Member

@yuxin00j , let's leave investigating what's going on with the install process for windows another time.

@martindurant martindurant merged commit 5eec9f9 into fsspec:master May 16, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove threading.get_ident() from instance cache token for asynchronous=False instances

2 participants