Skip to content

Add OTA concurrency limit#815

Draft
TheJulianJES wants to merge 5 commits into
devfrom
zigpy-bot/ota-update-concurrency-limit
Draft

Add OTA concurrency limit#815
TheJulianJES wants to merge 5 commits into
devfrom
zigpy-bot/ota-update-concurrency-limit

Conversation

@TheJulianJES

@TheJulianJES TheJulianJES commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Related to Core recently introducing an "Update all" button, with integrations needing to handle all entities being scheduled to update at once.

With ZHA being split off into an addon at some point, it makes sense to introduce this limit in ZHA itself, as multiple HA Core instances could theoretically be connected to one ZHA server.

Summary

A Zigbee network can only carry so many OTA image transfers at once before airtime saturation slows every update (and other traffic) down. This adds a network-wide cap on how many firmware updates transfer concurrently; updates started beyond the limit are queued (FIFO) rather than all running at once.

  • New max_concurrent_ota_updates device option (default 4).
  • A gateway-level asyncio.Semaphore sized from that option. asyncio wakes waiters FIFO, so the semaphore is the queue — no extra bookkeeping.
  • FirmwareUpdateEntity.async_install acquires a slot before transferring the image and releases it when done or on failure, so a failed/stuck update can't wedge the queue.
  • A new in_queue state attribute distinguishes queued from actively transferring.

Queued state

A queued update reports:

  • in_progress = True — the install was requested (keeps the Install button disabled)
  • in_queue = True
  • update_percentage = None

Once a slot frees and the transfer starts, in_queue flips to False and real progress percentages flow. This mirrors how the ESPHome Device Builder surfaces a queued firmware job.

What this looks like in Home Assistant today (and a possible follow-up)

HA's update entity has no native "queued" concept, and zha's update entity in Core forwards only in_progress / update_percentage (not in_queue). So while queued, HA shows the generic "Installing…" with an indeterminate progress bar — the same as an update that has started but not yet reported a percentage. The queue works correctly; it's just not visually distinct from "installing" in the stock UI.

Making "queued" visually distinct would need a Core-side change (smallest: surface in_queue as an extra state attribute in homeassistant/components/zha/update.py; larger: a first-class queued state in UpdateEntity + frontend). This library exposes in_queue so either is possible as a follow-up without reworking this PR.

Note: zha's update platform in Core does not set PARALLEL_UPDATES, so today an "Update all" fires every install concurrently. This PR is what serializes/queues them — and the library semaphore is the right layer for it (network-wide, configurable N), rather than PARALLEL_UPDATES = 1, which can only express "one or unlimited".

Prior art: how Z-Wave, Matter, and ESPHome handle this
  • Z-Wave JS — hard-limits to 1 concurrent update, network-wide, enforced in the driver as a rejection (FirmwareUpdateCC_NetworkBusy); a second update throws. HA's zwave_js adds PARALLEL_UPDATES = 1, which serializes "Update all" so the driver never actually sees a concurrent attempt. Same airtime/reliability concern: firmware transfer is a long, fragmented, ACK-driven operation that saturates a low-bandwidth mesh.
  • Matterno global limit at any layer; only a per-node guard in python-matter-server (_nodes_in_ota) stops the same node updating twice. Different nodes transfer concurrently by design (higher-bandwidth transports: Wi-Fi / Thread / Ethernet).
  • ESPHome — the real queue lives in the Device Builder backend: two serial lanes (compile + upload), asyncio.Queue, QUEUED → RUNNING status. The HA integration doesn't use PARALLEL_UPDATES (it's 0); for the dashboard-driven path it adds only asyncio.Locks — a global lock around the compile step, plus a per-device guard that rejects a second update of the same device — while the device-native OTA path has no limit at all.

Why 4: Zigbee is a low-bandwidth 2.4 GHz mesh like Z-Wave, so Z-Wave's conservatism is the right reference — but Zigbee has more headroom (faster PHY; OTA block transfers are device-paced, so idle devices don't hammer the air), and this PR queues rather than rejects (friendlier for "Update all", like ESPHome). 4 keeps meaningful parallelism (a batch updates ~4× faster than serial) while leaving the coordinator airtime for normal traffic. It stays configurable, so anyone wanting Z-Wave-style serialization can set it to 1.

Testing

  • New tests in tests/test_update.py: the gateway sizes its semaphore from the configured limit; a second update queues (in_queue) behind a first holding the only slot; FIFO hand-off when the slot frees; and the slot is released on failure so a queued update still proceeds.
  • Regenerated device diagnostics — the update entity's state gained "in_queue": false, so every snapshot with a firmware update entity was updated via tools/regenerate_diagnostics.py. That diff is purely the added key.

A Zigbee network can only carry so many OTA image transfers at once
before airtime saturation slows every update (and other traffic) down.
Add a `max_concurrent_ota_updates` device option (default 5) and a
gateway-level semaphore that firmware update entities will acquire a
slot from, queuing any updates started beyond the limit.
Firmware update entities now acquire a slot from the gateway's OTA
update semaphore before transferring an image. When the concurrent
update limit is reached, further installs wait (FIFO) for a slot to
free up instead of all transferring at once.

Expose the wait via a new `in_queue` state attribute: a queued update
reports `in_progress` True (the install was requested) with `in_queue`
True and no percentage until its transfer actually starts, mirroring how
the ESPHome Device Builder surfaces a queued firmware job.
Cover the gateway sizing its OTA semaphore from the configured limit,
a second update queuing (in_queue) behind a first while a single slot is
held, FIFO hand-off when the slot frees, and slot release on failure so a
queued update can still proceed.
The update entity's state dict now carries an `in_queue` field, so every
device snapshot with a firmware update entity gains `"in_queue": false`
at rest. Regenerated via tools/regenerate_diagnostics.py.
Zigbee OTA saturates the same low-bandwidth 2.4GHz mesh airtime that
Z-Wave does, and Z-Wave JS serializes firmware updates to just one at a
time network-wide. Zigbee has more headroom (faster PHY, device-paced
block transfers) so full serialization is unnecessary, but 4 is a more
conservative default than 5 while still letting a batch update several
times faster than serial. Still configurable via max_concurrent_ota_updates.
@TheJulianJES TheJulianJES requested a review from Copilot July 3, 2026 04:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.29%. Comparing base (a0e3195) to head (f7be738).
⚠️ Report is 2 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #815   +/-   ##
=======================================
  Coverage   97.29%   97.29%           
=======================================
  Files          55       55           
  Lines       10933    10951   +18     
=======================================
+ Hits        10637    10655   +18     
  Misses        296      296           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants