Add OTA concurrency limit#815
Draft
TheJulianJES wants to merge 5 commits into
Draft
Conversation
A Zigbee network can only carry so many OTA image transfers at once before airtime saturation slows every update (and other traffic) down. Add a `max_concurrent_ota_updates` device option (default 5) and a gateway-level semaphore that firmware update entities will acquire a slot from, queuing any updates started beyond the limit.
Firmware update entities now acquire a slot from the gateway's OTA update semaphore before transferring an image. When the concurrent update limit is reached, further installs wait (FIFO) for a slot to free up instead of all transferring at once. Expose the wait via a new `in_queue` state attribute: a queued update reports `in_progress` True (the install was requested) with `in_queue` True and no percentage until its transfer actually starts, mirroring how the ESPHome Device Builder surfaces a queued firmware job.
Cover the gateway sizing its OTA semaphore from the configured limit, a second update queuing (in_queue) behind a first while a single slot is held, FIFO hand-off when the slot frees, and slot release on failure so a queued update can still proceed.
The update entity's state dict now carries an `in_queue` field, so every device snapshot with a firmware update entity gains `"in_queue": false` at rest. Regenerated via tools/regenerate_diagnostics.py.
Zigbee OTA saturates the same low-bandwidth 2.4GHz mesh airtime that Z-Wave does, and Z-Wave JS serializes firmware updates to just one at a time network-wide. Zigbee has more headroom (faster PHY, device-paced block transfers) so full serialization is unnecessary, but 4 is a more conservative default than 5 while still letting a batch update several times faster than serial. Still configurable via max_concurrent_ota_updates.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #815 +/- ##
=======================================
Coverage 97.29% 97.29%
=======================================
Files 55 55
Lines 10933 10951 +18
=======================================
+ Hits 10637 10655 +18
Misses 296 296 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to Core recently introducing an "Update all" button, with integrations needing to handle all entities being scheduled to update at once.
With ZHA being split off into an addon at some point, it makes sense to introduce this limit in ZHA itself, as multiple HA Core instances could theoretically be connected to one ZHA server.
Summary
A Zigbee network can only carry so many OTA image transfers at once before airtime saturation slows every update (and other traffic) down. This adds a network-wide cap on how many firmware updates transfer concurrently; updates started beyond the limit are queued (FIFO) rather than all running at once.
max_concurrent_ota_updatesdevice option (default 4).asyncio.Semaphoresized from that option.asynciowakes waiters FIFO, so the semaphore is the queue — no extra bookkeeping.FirmwareUpdateEntity.async_installacquires a slot before transferring the image and releases it when done or on failure, so a failed/stuck update can't wedge the queue.in_queuestate attribute distinguishes queued from actively transferring.Queued state
A queued update reports:
in_progress = True— the install was requested (keeps the Install button disabled)in_queue = Trueupdate_percentage = NoneOnce a slot frees and the transfer starts,
in_queueflips toFalseand real progress percentages flow. This mirrors how the ESPHome Device Builder surfaces a queued firmware job.What this looks like in Home Assistant today (and a possible follow-up)
HA's
updateentity has no native "queued" concept, andzha's update entity in Core forwards onlyin_progress/update_percentage(notin_queue). So while queued, HA shows the generic "Installing…" with an indeterminate progress bar — the same as an update that has started but not yet reported a percentage. The queue works correctly; it's just not visually distinct from "installing" in the stock UI.Making "queued" visually distinct would need a Core-side change (smallest: surface
in_queueas an extra state attribute inhomeassistant/components/zha/update.py; larger: a first-class queued state inUpdateEntity+ frontend). This library exposesin_queueso either is possible as a follow-up without reworking this PR.Note:
zha's update platform in Core does not setPARALLEL_UPDATES, so today an "Update all" fires every install concurrently. This PR is what serializes/queues them — and the library semaphore is the right layer for it (network-wide, configurableN), rather thanPARALLEL_UPDATES = 1, which can only express "one or unlimited".Prior art: how Z-Wave, Matter, and ESPHome handle this
FirmwareUpdateCC_NetworkBusy); a second update throws. HA'szwave_jsaddsPARALLEL_UPDATES = 1, which serializes "Update all" so the driver never actually sees a concurrent attempt. Same airtime/reliability concern: firmware transfer is a long, fragmented, ACK-driven operation that saturates a low-bandwidth mesh.python-matter-server(_nodes_in_ota) stops the same node updating twice. Different nodes transfer concurrently by design (higher-bandwidth transports: Wi-Fi / Thread / Ethernet).asyncio.Queue,QUEUED → RUNNINGstatus. The HA integration doesn't usePARALLEL_UPDATES(it's0); for the dashboard-driven path it adds onlyasyncio.Locks — a global lock around the compile step, plus a per-device guard that rejects a second update of the same device — while the device-native OTA path has no limit at all.Why 4: Zigbee is a low-bandwidth 2.4 GHz mesh like Z-Wave, so Z-Wave's conservatism is the right reference — but Zigbee has more headroom (faster PHY; OTA block transfers are device-paced, so idle devices don't hammer the air), and this PR queues rather than rejects (friendlier for "Update all", like ESPHome). 4 keeps meaningful parallelism (a batch updates ~4× faster than serial) while leaving the coordinator airtime for normal traffic. It stays configurable, so anyone wanting Z-Wave-style serialization can set it to
1.Testing
tests/test_update.py: the gateway sizes its semaphore from the configured limit; a second update queues (in_queue) behind a first holding the only slot; FIFO hand-off when the slot frees; and the slot is released on failure so a queued update still proceeds."in_queue": false, so every snapshot with a firmware update entity was updated viatools/regenerate_diagnostics.py. That diff is purely the added key.