Migrate quirks out of zigpy and provide direct entity access#762
Conversation
|
Here's the zigpy side: zigpy/zigpy@dev...puddly:puddly/quirks-v3-resolver I wouldn't recommend it but you can run ZHA with these three library branches in tandem. Nothing is visibly broken 😅 |
zigpy-review-bot
left a comment
There was a problem hiding this comment.
RFC engagement — click to expand (architectural notes: inversion direction, three matching mechanisms, public/internal API line, dispatch duplication)
Engaging with this as an RFC rather than a code review — happy this is being explored. The user-facing API (a Device subclass + DeviceMatch + Replace tuple + discover_entities override) reads well; reaching for it from the Hue native-control case shows it can carry weight that v2's declarative builder can't.
A few architectural things to push on before promoting from draft:
1. The inversion (and what the title promises) needs the companion zigpy branch to land further
The PR title is "Migrate quirks out of zigpy." The diff on https://github.com/puddly/zigpy/tree/puddly/decouple-zigpy doesn't actually remove anything from zigpy — it adds a _device_resolver callable (plus register_device_resolver and a device_resolver= kwarg to ControllerApplication.new) and routes _finalize_device / appdb._load_attributes through it. The legacy zigpy.quirks.get_device and the v1/v2 registry are unchanged, and zha.zigbee.device.resolve_device still falls back to zigpy.quirks.get_device(...) for unmatched devices.
In the meantime ZHA gains a duplicate API surface that has to be kept in sync with zigpy's: zha.application.EntityType / EntityPlatform shadowing zigpy.quirks.v2.EntityType / EntityPlatform, full BinarySensor/Sensor/Number device-class enums copied into zha/application/platforms/*/device_class.py (~950 LOC), and unit constants in zha.units expanded to mirror more of homeassistant.const. The new tools/compare_constants.py exists specifically to catch drift between ZHA's copies and HA's canonical definitions — useful, but the fact it's needed is a signal we're maintaining three copies (HA Core, zigpy quirks v2, ZHA), not eliminating one.
What would actually deliver on the title: a corresponding zigpy PR that removes zigpy.quirks.v2.homeassistant.* (the enums/units now mirrored in ZHA), and ideally moves EntityMetadata/QuirksV2RegistryEntry toward their final home. Tests in this PR still from zigpy.quirks.v2.homeassistant import EntityPlatform, EntityType; production code still does from zigpy.quirks.v2 import EntityMetadata, ZCLEnumMetadata, … in 10 places across zha/. For the inversion to be real, those should be importing from zha.* and the zigpy module should be the one missing.
Worth pinning down in the RFC: is the target end-state "zigpy knows nothing about HA semantics, ZHA owns all of quirks-v2," or just "ZHA can register a resolver"? They're very different scopes.
2. Three matching mechanisms now (CustomDevice / QuirkBuilder / @register_device) — deprecation story?
The PR body says the functional builder remains for simple cases and class-based is for complex devices, but the cut is fuzzy: a QuirkBuilder with one .replaces() and one .switch() and a @register_device Device subclass with one Replace operation produce almost the same surface. The Hue example shows what's uniquely possible (subscribing to manufacturer cluster commands inside the entity, applying multi-attribute frames atomically) — but that uniqueness comes from subclassing Light and overriding _async_turn_on_impl, not from the Device subclass itself. A QuirkBuilder that could attach a custom entity class via something like .add_entity_class(HueNativeLight, on=HueLightCluster) would cover the same use case without a third matching path.
Open questions worth answering before this stabilizes:
- Is
CustomDevicegoing away on a timeline? It's described as "Not recommended" in the PR body but isn't touched here. - Will existing v2 quirks migrate, or just coexist?
zha-device-handlershas a lot of v2 quirks, and a migration script vs. a coexistence story have very different costs. - If
QuirkBuilderis "inverted to be a functional API built on top of this one" (PR body), what does that desugaring look like? That's the answer that makes "three becomes two becomes one" feel achievable rather than aspirational.
3. Device.discover_entities as a public override point — stability cost
Quirks now subclass the ZHA Device and override discover_entities, with direct access to endpoint.all_cluster_handlers, Light.__init__'s positional kwargs, self.cluster_handlers[…], maybe_emit_state_changed_event, etc. That freezes a lot of ZHA-internal shape that's churned recently — #657 just reworked cluster handlers / entity attributes, and #716/#763 are actively touching _is_entity_removed_by_quirk and adjacent surfaces. Once a non-trivial number of quirks reach into Light internals (or any other entity class), refactors like #657 become breaking changes for an out-of-tree package.
The Hue example reaches into:
endpoint.all_cluster_handlers(adictkeyed by"{ep_id}:0x{cluster_id:04x}"strings — a format choice that's an implementation detail right now)self.cluster_handlers[HueLightCluster.ep_attribute](also a dict; key shape differs)Light._state/Light._brightness/Light._xy_color/Light._color_temp/Light._effect(underscored attrs, currently free to rename)self._on_off_cluster_handler(presumably ZHA'sOnOffcluster handler — name shape is informal)- positional
cluster_handlers=,endpoint=,device=onLight.__init__
Which of these are intended as public quirks API and which were stable-by-accident? A small Device API doc — or even just # Public: / # Internal: annotations on the methods/attrs quirks may touch — would let ZHA refactor freely below the line while keeping the contract intact above it. Otherwise the next #657-scale refactor lands as "please update your quirks" tickets in zha-device-handlers.
The prevent_default_entity_creation path (added in #716, refined in #763) and the in-iterator if entity.PLATFORM == Platform.LIGHT: continue filter in the Hue example do overlapping work. Worth clarifying which path quirks-v3 is supposed to use, since the example chose the manual filter.
4. Subclass dispatch happens at two layers — easy to drift
Both resolve_device (zigpy-time, applies _operations) and Device.new (ZHA-time, picks subclass) iterate DEVICE_QUIRKS and call cls.matches(zigpy_device). They could disagree if matching ever depended on mutable state (e.g. firmware version changed between the two calls). Caching the matched class on the zigpy device in resolve_device and reading it in Device.new would remove that possibility — and the matching cost.
Related, CoordinatorDevice is chosen by ieee equality in Device.new but never registered via @register_device, so it doesn't go through resolve_device (no operations to apply — fine) but the asymmetry between "hardcoded subclass for coordinator" and "registry-driven subclass for everything else" is worth a comment.
Companion-branches reality check
For the record: I checked out the puddly/decouple-zigpy branches on both zigpy and zha-device-handlers. The zigpy side is one WIP commit (the resolver hook, no quirks-API removal). The quirks side is one WIP commit adding only zhaquirks/philips/{bifrost,hue_native}.py (the example from the PR body). The trio is consistent if all three are pinned together. "Nothing visibly broken" reads roughly right for the happy path; current CI failures (_FakeApp mocks missing _device_resolver) are just version skew against unbumped zigpy and will resolve once that lands.
Roadmap context
OpenHomeFoundation/roadmap#93 names "improve APIs to make adding new devices simpler" as in-scope. This PR is a credible early step there — the API ergonomics for a complex device end up much closer to ad-hoc Python than the current v2 builder lets you get. The Tuya-IR-module use case the PR mentions is exactly where v2's declarative model breaks down, and a class-based shape is the right shape for that.
Small things noted but not blockers (RFC stage)
Quickly, since this is RFC and ergonomics matter more than polish right now: Replace.apply() pops the existing cluster before constructing the replacement (loses the original on construction failure); resolve_device logs at warning level on every matched join (should be debug once de-experimentalized); the broad except Exception in _discover_new_entities will silently swallow real construction bugs and could use the entity / endpoint / cluster handler in the log message; pyproject.toml is now zha-quirks>=1.2.0, which until the first quirks release that imports ZHA will pull in versions that don't actually use the new API. None of these change the design discussion.
What would let me sign off later: clarity on items 1 (target end-state for zigpy) and 2 (deprecation story for the three mechanisms), and a small thinking-out-loud doc on item 3 (which Device/Light surfaces are public quirks API). The rest is concrete code asks that flow naturally once those are settled.
| NUMBER = "number" | ||
| SENSOR = "sensor" | ||
| SELECT = "select" | ||
| SWITCH = "switch" |
There was a problem hiding this comment.
EntityPlatform and EntityType are defined here, but tests still from zigpy.quirks.v2.homeassistant import EntityPlatform, EntityType (e.g. tests/test_sensor.py:18, tests/test_device.py:19) and production from zigpy.quirks.v2 import EntityMetadata, ZCLEnumMetadata, … in ~10 places. Coexistence during migration is fine; flagging so the cleanup pass doesn't get lost when the companion zigpy PR drops the zigpy copies. The goal-state is presumably that zha.application.EntityPlatform is the only definition and zigpy.quirks.v2.homeassistant is gone.
There was a problem hiding this comment.
Production side is done — zha.application owns EntityType/EntityPlatform and no zigpy.quirks.v2 imports remain under zha/. The test suite isn't migrated yet: tests/test_sensor.py:21, tests/test_device.py:19-20, tests/test_button.py:19 (and others) still from zigpy.quirks.v2[.homeassistant] import …, now resolving through the deprecation shims. Leaving this thread open until that cleanup lands.
|
Hey @puddly I stumbled upon this after doing a little investigating myself into what it would take to get the Hue bulb's 0xFC03 working myself - and came to a lot of the same conclusions you seem to have come to (that the responsibilities are backward and you should be implementing the quirk behavior in the quirk) in this PR, so happily surprised to see this being worked on. Crossing my fingers that something like this eventually lands, It looks like just a POC for now, lots of changes here. But it looks to my (again, relatively new to the codebase but experienced in SW development) eye to be a promising direction, at least for my use case Thanks for working on this! |
@jaredjxyz For reference, FYI, also see these related requests/discussions asking for zha and ZHA Device Handlers quirks to be able to nativly support and handle these type of "Hue native control" Dynamic Scenes Effects / Animations. Checkout: and Note! Another device that could also use native control of effect commands via quirks is Lidl Christmas led string HG0646: Lidl Christmas led string HG0646 uses cluster 0x00FC to invoke invoked native effects commands for their device: |
|
@puddly - I'm putting my hand up to help test when the time is right. If that is out of the normal beta cycle, that's OK too but I'll need a muppet-proof method to install it. I don't normally engage in testing the betas but I'm so keen to have full Hue support that I will for this :). |
77c0f49 to
4feb202
Compare
There was a problem hiding this comment.
_async_device_reinterviewed: the old_entry is not new_entry swap silently drops entity updates
The swap branch fires whenever the resolved quirk entry changes:
else:
await zha_device.async_teardown(emit_entity_events=True)
zha_device = Device.new(new_zigpy_device, self)
self._devices[new_zigpy_device.ieee] = zha_deviceThis is both broader than necessary and, for a common case, incorrect: a firmware change that selects a different v2 quirk (different entities, but still the base Device class) hits this branch, and the entity changes never reach HA. The in-place rebuild path it should take handles them correctly.
Full trace: why the swap drops entity updates and the in-place path doesn't (CLICK TO EXPAND)
The bug — firmware-version quirks lose their entity changes in HA. A very common reinterview trigger is a firmware change selecting a different v2 quirk that adds/removes entities but still resolves to the base Device class (no custom zha_device_factory). Those hit the else branch today, and the new entities never reach HA:
Device.newreturns a fresh object withself._initialized = False.- On its
async_initialize,await self._add_pending_entities(emit_event=self._initialized)runs withemit_event=False, so entities are added silently — noDeviceEntityAddedEvent. - HA's
handle_device_fully_initialized(the event you emit afterward) fetches the proxy and only rebuilds entity metadata whenevent.new_joinis True — false for a reinterview. And_async_get_or_create_device_proxyreturns the stale proxy (still bound to the torn-down object), so HA isn't tracking the new device at all.
Result: a firmware update that should add/remove entities produces no entity change in HA.
The in-place path already does this correctly. async_rebuild_from_zigpy_device keeps object identity, and _initialized stays True across it (neither async_teardown nor _init_from_zigpy_device resets it). So its async_initialize runs _add_pending_entities(emit_event=True) → DeviceEntityAddedEvent per entity → HA's handle_zha_device_entity_added_event creates them; teardown's soft removals (remove=False) unload the old ones while preserving their registry entries (so unchanged unique_ids re-attach with customizations intact). The emit_event=self._initialized gate is designed for exactly this ("Emit events only on re-initialization, not the first").
Fix: narrow the swap to genuine Device-class changes
Only swap when the resolved class actually differs (custom zha_device_factory); otherwise rebuild in place — _init_from_zigpy_device re-reads the new quirk's metadata, so the new entity set is picked up:
def _factory(entry):
return entry.zha_device_factory if (entry and entry.zha_device_factory) else Device
if _factory(old_entry) is _factory(new_entry):
await zha_device.async_rebuild_from_zigpy_device(new_zigpy_device) # same object, entities propagate
else:
... # Device.new — genuine class changeThis routes all standard/firmware-version quirk changes onto the working path and confines the swap to genuine base⇄custom-subclass transitions.
The residual class-change swap still needs a paired HA Core change (CLICK TO EXPAND)
For the remaining case — a device-level custom-Device quirk applied or removed via reinterview — the new object is unavoidable (you can't cleanly re-class a live instance), and current HA Core can't follow the swap:
ZHADeviceProxy.__init__bindsself.deviceand subscribesself.device.on_all_events(...)once, and.deviceis never reassigned anywhere._async_get_or_create_device_proxyis keyed by IEEE and returns the existing proxy unchanged — no repoint, no re-subscribe.- There's no
handle_device_reinterviewed; theDeviceFullInitEventyou emit lands on the stale proxy and skips entity rebuild (new_joinis False).
So after a genuine class-change swap, HA keeps a proxy bound to the torn-down object, the new device's events never reach it, and entities aren't refreshed. (The HA device-registry entry survives — it's keyed by IEEE — so it's "only" stale entities/events, but it's user-visible.)
The HA side needs:
- a repointable
ZHADeviceProxy(async_rebind: unsub → setself.device→ re-subscribeon_all_events+ the sw-version sub), _async_get_or_create_device_proxycalling it when an existing proxy wraps a different object, and- running the existing
_create_entity_metadata+SIGNAL_ADD_ENTITIESpath on a swap (today gated behindnew_join).
To make the rebind deterministic rather than inferred from DeviceFullInitEvent + an object-identity mismatch, it'd help for ZHA to emit a dedicated DeviceReinterviewedEvent/DeviceReplacedEvent before async_configure/async_initialize. Worth tracking on roadmap#93 as a multi-repo item.
Suggested test
Reinterview a device into a quirk (same Device class) that adds one entity and drops another; assert the corresponding DeviceEntityAddedEvent/DeviceEntityRemovedEvent fire and that re-init keeps _initialized=True (it's load-bearing and implicit — a future reset in teardown would silently reintroduce this bug).
Reference: re-interview event flow today
Event sequence for a re-interview today (CLICK TO EXPAND)
After zigpy fires device_reinterviewed, Gateway._async_device_reinterviewed runs the sequence below. Entity/configure events are emitted on the ZHA Device's stream (consumed by ZHADeviceProxy, subscribed via device.on_all_events); DeviceFullInitEvent is gateway-level (consumed by ZHAGatewayProxy, subscribed via gateway.on_all_events).
| # | ZHA step | Event emitted | HA handler → effect |
|---|---|---|---|
| 1 | async_teardown(emit_entity_events=True) |
DeviceEntityRemovedEvent (remove=False) × existing entities |
handle_zha_device_entity_removed_event → soft unload via SIGNAL_REMOVE_ENTITY_*; registry entry kept |
| 2a | in-place: _init_from_zigpy_device(new) |
— | — |
| 2b | swap: Device.new(new) + self._devices[ieee] = … |
— | — (proxy still bound to the old object) |
| 3 | async_configure() → emit_reconfigure_done() |
DeviceConfiguredEvent |
handle_zha_device_configured → SIGNAL_DEVICE_RECONFIGURE_EVENT (zha_channel_cfg_done), unsticks the reconfigure dialog |
| 4 | async_initialize() → _add_pending_entities(emit_event=self._initialized) |
DeviceEntityAddedEvent × new entities — only when _initialized is True |
handle_zha_device_entity_added_event → append EntityData + SIGNAL_ADD_ENTITIES |
| 5 | group.update_entity_subscriptions() |
group membership signals | group handlers refresh subscriptions |
| 6 | if all succeeded | DeviceFullInitEvent(CONFIGURED) (gateway-level) |
handle_device_fully_initialized → re-fetches proxy; rebuilds entity metadata only if new_join (False on reinterview) |
In-place path (same object, _initialized stays True): steps 1, 3 and 4 all fire on the object HA's proxy is subscribed to, so removals, the dialog-unstick, and the adds all reach HA → entities reconcile correctly.
Swap path (new object, fresh _initialized=False): step 1 still reaches HA (it fires on the old object, before the swap). But steps 3 and 4 fire on the new object — which HA's stale proxy isn't subscribed to — and step 4 is additionally silent (emit_event=False). The only thing that reaches HA is the gateway-level DeviceFullInitEvent (step 6), and it does nothing useful (new_join is False, and it just re-fetches the same stale proxy). Net: removals land, but the dialog-unstick and the new entities don't.
There was a problem hiding this comment.
The factory-identity check doesn't fix the common case (v2 QuirkBuilder changes still swap)
Following up on f17cb124: comparing zha_device_factory by identity (new_factory is old_factory) only newly changes behavior (vs the old old_entry is new_entry) for no-factory entries — still the minority. It's effectively a no-op for v2 QuirkBuilder quirks, which keep taking the swap branch and reproduce the original entity-event bug.
Why: QuirkBuilder mints a fresh per-quirk factory — QuirkV2Factory(base, quirk_definition) (zhaquirks/builder/builder.py:1086) — and for any standard v2 quirk base = QuirkV2Device (:1085), i.e. the same class for all of them. So two different v2 quirks hold distinct factory objects of the same class, and new_factory is old_factory is always False. Since the factory is 1:1 with its registry entry, for anything carrying a factory the new check is equivalent to the old old_entry is new_entry. The only transitions it newly sends to in-place are between no-factory entries — which the comparison normalizes to Device, and which include both no quirk and v1/legacy quirks (legacy entries set zha_device_factory=None, zhaquirks/__init__.py:551). That's exactly (and only) what test_reinterview_in_place_events exercises; test_reinterview_replaces_device_when_quirk_changes (:650) uses two QuirkBuilder quirks and asserts new_zha_device is not zha_device, locking the swap in for the common case.
What actually happens today (f17cb124)
| Reinterview transition | Path | HA outcome |
|---|---|---|
| No quirk → no quirk (clusters/endpoints change) | in-place | ✅ add/remove events reconcile |
| Same v2 quirk, re-interviewed | in-place | ✅ |
Different v2 quirk, same QuirkV2Device base (model/FW switch) |
swap | ❌ removals fire from the old object, but adds are silent and the proxy stays stale → new entities never appear |
| v2 quirk added / removed | swap | ❌ same as above |
| No-factory ⇄ no-factory (no quirk, v1/legacy — incl. one v1 → another) | in-place | ✅ (the one case the fix newly enables) |
Same custom ZHA Device subclass, re-interviewed |
in-place | ✅ if the subclass tolerates the _init_from_zigpy_device() rebuild |
Custom ZHA Device subclass changes (device-level quirk) |
swap | ❌ (swap genuinely required) |
Rows 3–4 are the common QuirkBuilder cases the fix was meant to cover but doesn't.
A class-only comparison is necessary but not sufficient
QuirkV2Device binds its definition at construction — self._quirk_definition = quirk_definition (zhaquirks/builder/device.py:37) — and doesn't override _init_from_zigpy_device, which re-reads the registry entry for quirk_class but never refreshes _quirk_definition. So forcing an in-place rebuild for a changed v2 quirk would keep the old definition → wrong entities/metadata, trading the missing-entity bug for a stale-metadata one. Any fix that just "rebuilds in place when the class is unchanged" has to also re-bind the new definition.
Two viable fixes
- Compare the produced class and refresh the definition in place — compare
QuirkV2Factory.base(and the directly-used class for custom-Devicequirks) instead of factory identity, and have the in-place rebuild re-bind_quirk_definitionfrom the newly-resolved entry. Same-class changes then reconcile via events with the correct definition; only a genuine class change swaps. - Keep the swap, make HA follow it — since
QuirkV2Deviceis definition-bound by design, a fresh object per definition change is arguably the cleaner model, in which case the fix belongs on the HA side: a repointableZHADeviceProxy(async_rebind) + entity refresh, ideally driven by a dedicatedDeviceReinterviewedEventemitted before configure/initialize (roadmap#93). This covers v2-definition changes and the genuine custom-Device-class swap uniformly, so it's the broader fix.
Test gap
Add a case that reinterviews between two QuirkBuilder quirks (e.g. firmware-version filters) that change the entity set, asserting the add/remove events fire and HA's entities reconcile — the scenario test_reinterview_in_place_events (no-factory entries) doesn't reach, and where the bug actually lives.
Correction: Option 1 can't cover applying or removing a quirk — so Option 2 is the load-bearing fix
The two options above aren't co-equal. A class-comparison fix (Option 1) can only ever keep object identity when the ZHA Device class is unchanged — but applying or removing a quirk is a class change:
- non-quirked device →
Device.newfalls through to baseDevice; - v2-quirked device →
QuirkV2FactoryproducesQuirkV2Device(a subclass ofDevice).
So Device ⇄ QuirkV2Device is a genuine class change, and Option 1 still swaps it:
| Transition | Class change | Under Option 1 |
|---|---|---|
v2_A → v2_B (same QuirkV2Device base) |
no | ✅ fixed (in-place + definition refresh) |
| no quirk → v2 quirk | Device → QuirkV2Device |
❌ still swaps |
| v2 quirk → no quirk | QuirkV2Device → Device |
❌ still swaps |
| v1 ↔ v2 | Device → QuirkV2Device |
❌ still swaps |
| custom subclass changes | yes | ❌ still swaps |
"No quirk → v2 quirk" is arguably the most common reinterview-driven transition (a firmware-gated quirk starting to match after an OTA, or a device that joined before its quirk existed and is reinterviewed after a zha-device-handlers update) — and Option 1 can't help it.
So Option 2 (make the swap HA-safe: repointable ZHADeviceProxy + entity refresh, ideally via a dedicated DeviceReinterviewedEvent) is required — it's the only thing that fixes applying/removing a quirk and the custom-subclass case. Option 1 is just a partial optimization that keeps v2→v2-same-base on the in-place path; it is not a substitute for Option 2.
|
The device swap out issue will be handled within Core: we can check device object identity to detect this. There is a circular dependency between zha-device-handlers and this repo (for CI) so this will have to be merged bypassing checks. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #762 +/- ##
==========================================
- Coverage 97.41% 97.29% -0.13%
==========================================
Files 50 55 +5
Lines 10419 10932 +513
==========================================
+ Hits 10150 10636 +486
- Misses 269 296 +27 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
TheJulianJES
left a comment
There was a problem hiding this comment.
Looks good enough to me 😄
This is a test PR for inverting the quirks dependency chain. Currently, zigpy is responsible for defining the quirks API, ZHA defines the base entity models and ZCL autodiscovery, and quirks use the zigpy APIs to provide ZHA context on how to create custom entities. This PR inverts it: all quirks constants and definitions will move out of zigpy and into ZHA, quirks will import ZHA itself, and ZHA will unpin quirks as an
==dependency.Currently, we have a few APIs for matching devices for the purposes of modifying their clusters at runtime:
CustomDevice: this matches the full device signature exactly and provides a way to replace the device object directly. Not recommended.QuirkBuilder: matches by model and manufacturer, plus a firmware version filter. Also aggregates small operations for modifying clusters and endpoints but also provides a simplified API for exposing entities derived from ZCL attributes.@register_device: it sort of aggregates the two by having the ZHADevicesubclass "be" theQuirkBuilder. We'd need an API to also allow replacing the zigpyDeviceobject as well. This has been yanked out of zigpy by creating a "device resolver" callable that ZHA can provide. It replaceszigpy.quirks.get_deviceby allowing ZHA to itself provide a replacement device, avoiding the zigpy quirks registry entirely.As a test implementation, I've added support for the manufacturer-specific Hue lighting frame format in a fork of quirks. It look something like this (100% slop that manages to run):
I think we can greatly minimize the Python code required to implement custom entities with a better API. This has a few benefits:
This is part of / a requirement for: