Skip to content

Simplify CPU bucket cache locking#14

Open
TianyeGGBond wants to merge 2 commits into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:tianye/simplify-cache-locking
Open

Simplify CPU bucket cache locking#14
TianyeGGBond wants to merge 2 commits into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:tianye/simplify-cache-locking

Conversation

@TianyeGGBond

Copy link
Copy Markdown

Context

The F4 CPU bucket cache path currently has two layers of locking around the same state:

  • CPUBucketCache owns an internal threading.Lock for _buckets and _cache_ready_step.
  • MegatronTrainRayActor also creates _cache_lock and wraps both build_cpu_bucket_cache and run_sync_session with it.

In the current deployment model, train actors are created as default synchronous Ray actors without max_concurrency, and these methods are not async. That means build_cpu_bucket_cache and run_sync_session are already serialized by Ray on the same actor. The actor-level lock is therefore redundant, and its comments make the implementation look more concurrent than it is.

What changed

  • Removed MegatronTrainRayActor._cache_lock initialization.
  • Removed the outer actor-level lock around build_cpu_bucket_cache.
  • Removed the outer actor-level lock around run_sync_session.
  • Kept CPUBucketCache's internal lock so the cache object remains self-contained and protects its own state.
  • Shortened F4 cache docstrings and comments to describe the API behavior without carrying review-history or over-defensive concurrency rationale in code.

Why

This keeps the locking model to one layer while preserving behavior. The cache still publishes and reads _cache_ready_step through its own methods, but the train actor no longer adds a redundant critical section around synchronous Ray actor methods. This is closer to the style in miles main: rely on the actor execution model where it applies, and keep local state protection inside the small helper object.

Validation

  • python -m py_compile miles/backends/megatron_utils/actor.py miles/backends/megatron_utils/update_weight/cpu_bucket_cache.py
  • git diff --check
  • git grep -n _cache_lock -- miles/backends/megatron_utils/actor.py miles/backends/megatron_utils/update_weight/cpu_bucket_cache.py returns no matches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant