Skip to content

Migrate quirks out of zigpy and provide direct entity access#762

Merged
puddly merged 42 commits into
zigpy:devfrom
puddly:puddly/decouple-zigpy
Jun 24, 2026
Merged

Migrate quirks out of zigpy and provide direct entity access#762
puddly merged 42 commits into
zigpy:devfrom
puddly:puddly/decouple-zigpy

Conversation

@puddly

@puddly puddly commented May 11, 2026

Copy link
Copy Markdown
Contributor

This is a test PR for inverting the quirks dependency chain. Currently, zigpy is responsible for defining the quirks API, ZHA defines the base entity models and ZCL autodiscovery, and quirks use the zigpy APIs to provide ZHA context on how to create custom entities. This PR inverts it: all quirks constants and definitions will move out of zigpy and into ZHA, quirks will import ZHA itself, and ZHA will unpin quirks as an == dependency.

Currently, we have a few APIs for matching devices for the purposes of modifying their clusters at runtime:

  1. CustomDevice: this matches the full device signature exactly and provides a way to replace the device object directly. Not recommended.
  2. QuirkBuilder: matches by model and manufacturer, plus a firmware version filter. Also aggregates small operations for modifying clusters and endpoints but also provides a simplified API for exposing entities derived from ZCL attributes.
  3. (proposed) @register_device: it sort of aggregates the two by having the ZHA Device subclass "be" the QuirkBuilder. We'd need an API to also allow replacing the zigpy Device object as well. This has been yanked out of zigpy by creating a "device resolver" callable that ZHA can provide. It replaces zigpy.quirks.get_device by allowing ZHA to itself provide a replacement device, avoiding the zigpy quirks registry entirely.

As a test implementation, I've added support for the manufacturer-specific Hue lighting frame format in a fork of quirks. It look something like this (100% slop that manages to run):

"""Philips Hue native control via cluster 0xFC03 — v3 device-level quirk."""

from __future__ import annotations

import asyncio
import logging
from typing import Any, Final

from zigpy.quirks import CustomCluster
import zigpy.types as t
from zigpy.zcl import ClusterType
from zigpy.zcl.clusters.general import LevelControl, OnOff
from zigpy.zcl.clusters.lighting import Color
from zigpy.zcl.foundation import BaseCommandDefs, ZCLCommandDef

from zha.application import Platform
from zha.application.platforms.light import Light
from zha.zigbee.device import Device, DeviceMatch, Replace, register_device

class HueNativeLight(Light):
    """Light driven by atomic Bifrost frames on cluster 0xFC03."""

    def __init__(self, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)
        self._hue_cluster_handler = self.cluster_handlers[HueLightCluster.ep_attribute]

    @property
    def effect_list(self) -> list[str] | None:
        return list(EFFECT_NAME_TO_HUE)

    def on_add(self) -> None:
        super().on_add()
        self._hue_cluster_handler.cluster.add_listener(self)

    def cluster_command(self, tsn: int, command_id: int, args: list[Any]) -> None:
        """Handle a server-issued cluster command on 0xFC03."""
        if command_id != HUE_NATIVE_COMMAND_ID:
            return
        payload: t.SerializableBytes = args[0]
        try:
            frame = decode(bytes(payload.serialize()))
        except ValueError:
            _LOGGER.exception("Failed to decode Hue native frame: %s", payload)
            return
        self._apply_frame(frame)

    def _apply_frame(self, frame: HueFrame) -> None:
        if frame.on_off is not None:
            self._state = frame.on_off
        if frame.brightness is not None:
            self._brightness = frame.brightness
        if frame.color_xy is not None:
            self._xy_color = frame.color_xy
        if frame.color_mirek is not None:
            self._color_temp = frame.color_mirek
        if frame.effect_type is not None:
            self._effect = HUE_TO_EFFECT_NAME.get(frame.effect_type, "none")
        self.maybe_emit_state_changed_event()

    async def _async_turn_on_impl(
        self,
        *,
        transition: float | None,
        brightness: int | None,
        effect: str | None,
        flash: Any,
        color_temp: int | None,
        xy_color: tuple[float, float] | None,
        duration: float,
        **_unused: Any,
    ) -> None:
        del flash, _unused  # Flash uses the Identify cluster, not Bifrost.
        frame = HueFrame(
            on_off=True,
            brightness=_clamp_brightness(brightness),
            color_mirek=color_temp,
            color_xy=xy_color,
            fade_speed=_seconds_to_fade_speed(transition or duration),
            effect_type=EFFECT_NAME_TO_HUE.get(effect) if effect else None,
        )
        await self._send_frame(frame)
        if effect and EFFECT_NAME_TO_HUE.get(effect) not in (None, HueEffect.NO_EFFECT):
            asyncio.get_running_loop().call_later(
                _POST_EFFECT_REFRESH_DELAY,
                lambda: asyncio.create_task(self._refresh_state()),
            )

    async def async_turn_off(self, *, transition: float | None = None) -> None:
        await self._send_frame(
            HueFrame(
                on_off=False,
                fade_speed=_seconds_to_fade_speed(transition),
            )
        )

    async def _send_frame(self, frame: HueFrame) -> None:
        payload = encode(frame)
        _LOGGER.debug("Hue native frame: %s", payload.hex())
        await self._hue_cluster_handler.cluster.hue_native(t.SerializableBytes(payload))

    async def _refresh_state(self) -> None:
        # OnOff/Level/Color report standard attributes; trigger a re-poll.
        await self._on_off_cluster_handler.cluster.read_attributes(
            ["on_off"], allow_cache=False
        )

@register_device
class HueLcx004(Device):
    """Hue Play gradient lightstrip (LCX004) — Bifrost-native control."""

    _device_match = DeviceMatch(
        manufacturers=frozenset({PHILIPS, SIGNIFY}),
        models=frozenset({"LCX004"}),
    )
    _operations = (
        Replace(
            endpoint_id=11,
            cluster_id=HUE_FC03_CLUSTER_ID,
            cluster_type=ClusterType.Server,
            replacement=HueLightCluster,
        ),
    )

    def discover_entities(self):
        for entity in super().discover_entities():
            # Skip existing Light entities
            if entity.PLATFORM == Platform.LIGHT:
                continue
            yield entity

        # Register new HueNativeLight entities
        for ep_id, endpoint in self.endpoints.items():
            if ep_id == 0:
                continue

            handlers = endpoint.all_cluster_handlers
            on_off = handlers.get(f"{ep_id}:0x{OnOff.cluster_id:04x}")
            fc03 = handlers.get(f"{ep_id}:0x{HUE_FC03_CLUSTER_ID:04x}")
            if on_off is None or fc03 is None:
                continue

            # Will be cleaned up once device handlers are removed
            cluster_handlers = [on_off, fc03]
            for opt_id in (LevelControl.cluster_id, Color.cluster_id):
                ch = handlers.get(f"{ep_id}:0x{opt_id:04x}")
                if ch is not None:
                    cluster_handlers.append(ch)

            yield HueNativeLight(
                cluster_handlers=cluster_handlers,
                endpoint=endpoint,
                device=self,
            )

I think we can greatly minimize the Python code required to implement custom entities with a better API. This has a few benefits:

  1. Complex devices can be supported via quirks directly. I originally planned this for a Tuya IR module and it would work the same.
  2. Quirks v2 can essentially be inverted to be a functional API built on top of this one. It would still be the preferred way to write quirks for simple devices. The only use case for this API is complex devices that have cross-entity interaction, hook into ZHA internals, etc.
  3. Zigpy can slowly shed the quirks APIs that (IMO) don't belong in it.

This is part of / a requirement for:

@puddly

puddly commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

Here's the zigpy side: zigpy/zigpy@dev...puddly:puddly/quirks-v3-resolver
And quirks: zigpy/zha-device-handlers@dev...puddly:puddly/decouple-zigpy

I wouldn't recommend it but you can run ZHA with these three library branches in tandem. Nothing is visibly broken 😅

@zigpy-review-bot zigpy-review-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC engagement — click to expand (architectural notes: inversion direction, three matching mechanisms, public/internal API line, dispatch duplication)

Engaging with this as an RFC rather than a code review — happy this is being explored. The user-facing API (a Device subclass + DeviceMatch + Replace tuple + discover_entities override) reads well; reaching for it from the Hue native-control case shows it can carry weight that v2's declarative builder can't.

A few architectural things to push on before promoting from draft:

1. The inversion (and what the title promises) needs the companion zigpy branch to land further

The PR title is "Migrate quirks out of zigpy." The diff on https://github.com/puddly/zigpy/tree/puddly/decouple-zigpy doesn't actually remove anything from zigpy — it adds a _device_resolver callable (plus register_device_resolver and a device_resolver= kwarg to ControllerApplication.new) and routes _finalize_device / appdb._load_attributes through it. The legacy zigpy.quirks.get_device and the v1/v2 registry are unchanged, and zha.zigbee.device.resolve_device still falls back to zigpy.quirks.get_device(...) for unmatched devices.

In the meantime ZHA gains a duplicate API surface that has to be kept in sync with zigpy's: zha.application.EntityType / EntityPlatform shadowing zigpy.quirks.v2.EntityType / EntityPlatform, full BinarySensor/Sensor/Number device-class enums copied into zha/application/platforms/*/device_class.py (~950 LOC), and unit constants in zha.units expanded to mirror more of homeassistant.const. The new tools/compare_constants.py exists specifically to catch drift between ZHA's copies and HA's canonical definitions — useful, but the fact it's needed is a signal we're maintaining three copies (HA Core, zigpy quirks v2, ZHA), not eliminating one.

What would actually deliver on the title: a corresponding zigpy PR that removes zigpy.quirks.v2.homeassistant.* (the enums/units now mirrored in ZHA), and ideally moves EntityMetadata/QuirksV2RegistryEntry toward their final home. Tests in this PR still from zigpy.quirks.v2.homeassistant import EntityPlatform, EntityType; production code still does from zigpy.quirks.v2 import EntityMetadata, ZCLEnumMetadata, … in 10 places across zha/. For the inversion to be real, those should be importing from zha.* and the zigpy module should be the one missing.

Worth pinning down in the RFC: is the target end-state "zigpy knows nothing about HA semantics, ZHA owns all of quirks-v2," or just "ZHA can register a resolver"? They're very different scopes.

2. Three matching mechanisms now (CustomDevice / QuirkBuilder / @register_device) — deprecation story?

The PR body says the functional builder remains for simple cases and class-based is for complex devices, but the cut is fuzzy: a QuirkBuilder with one .replaces() and one .switch() and a @register_device Device subclass with one Replace operation produce almost the same surface. The Hue example shows what's uniquely possible (subscribing to manufacturer cluster commands inside the entity, applying multi-attribute frames atomically) — but that uniqueness comes from subclassing Light and overriding _async_turn_on_impl, not from the Device subclass itself. A QuirkBuilder that could attach a custom entity class via something like .add_entity_class(HueNativeLight, on=HueLightCluster) would cover the same use case without a third matching path.

Open questions worth answering before this stabilizes:

  • Is CustomDevice going away on a timeline? It's described as "Not recommended" in the PR body but isn't touched here.
  • Will existing v2 quirks migrate, or just coexist? zha-device-handlers has a lot of v2 quirks, and a migration script vs. a coexistence story have very different costs.
  • If QuirkBuilder is "inverted to be a functional API built on top of this one" (PR body), what does that desugaring look like? That's the answer that makes "three becomes two becomes one" feel achievable rather than aspirational.

3. Device.discover_entities as a public override point — stability cost

Quirks now subclass the ZHA Device and override discover_entities, with direct access to endpoint.all_cluster_handlers, Light.__init__'s positional kwargs, self.cluster_handlers[…], maybe_emit_state_changed_event, etc. That freezes a lot of ZHA-internal shape that's churned recently — #657 just reworked cluster handlers / entity attributes, and #716/#763 are actively touching _is_entity_removed_by_quirk and adjacent surfaces. Once a non-trivial number of quirks reach into Light internals (or any other entity class), refactors like #657 become breaking changes for an out-of-tree package.

The Hue example reaches into:

  • endpoint.all_cluster_handlers (a dict keyed by "{ep_id}:0x{cluster_id:04x}" strings — a format choice that's an implementation detail right now)
  • self.cluster_handlers[HueLightCluster.ep_attribute] (also a dict; key shape differs)
  • Light._state / Light._brightness / Light._xy_color / Light._color_temp / Light._effect (underscored attrs, currently free to rename)
  • self._on_off_cluster_handler (presumably ZHA's OnOff cluster handler — name shape is informal)
  • positional cluster_handlers=, endpoint=, device= on Light.__init__

Which of these are intended as public quirks API and which were stable-by-accident? A small Device API doc — or even just # Public: / # Internal: annotations on the methods/attrs quirks may touch — would let ZHA refactor freely below the line while keeping the contract intact above it. Otherwise the next #657-scale refactor lands as "please update your quirks" tickets in zha-device-handlers.

The prevent_default_entity_creation path (added in #716, refined in #763) and the in-iterator if entity.PLATFORM == Platform.LIGHT: continue filter in the Hue example do overlapping work. Worth clarifying which path quirks-v3 is supposed to use, since the example chose the manual filter.

4. Subclass dispatch happens at two layers — easy to drift

Both resolve_device (zigpy-time, applies _operations) and Device.new (ZHA-time, picks subclass) iterate DEVICE_QUIRKS and call cls.matches(zigpy_device). They could disagree if matching ever depended on mutable state (e.g. firmware version changed between the two calls). Caching the matched class on the zigpy device in resolve_device and reading it in Device.new would remove that possibility — and the matching cost.

Related, CoordinatorDevice is chosen by ieee equality in Device.new but never registered via @register_device, so it doesn't go through resolve_device (no operations to apply — fine) but the asymmetry between "hardcoded subclass for coordinator" and "registry-driven subclass for everything else" is worth a comment.

Companion-branches reality check

For the record: I checked out the puddly/decouple-zigpy branches on both zigpy and zha-device-handlers. The zigpy side is one WIP commit (the resolver hook, no quirks-API removal). The quirks side is one WIP commit adding only zhaquirks/philips/{bifrost,hue_native}.py (the example from the PR body). The trio is consistent if all three are pinned together. "Nothing visibly broken" reads roughly right for the happy path; current CI failures (_FakeApp mocks missing _device_resolver) are just version skew against unbumped zigpy and will resolve once that lands.

Roadmap context

OpenHomeFoundation/roadmap#93 names "improve APIs to make adding new devices simpler" as in-scope. This PR is a credible early step there — the API ergonomics for a complex device end up much closer to ad-hoc Python than the current v2 builder lets you get. The Tuya-IR-module use case the PR mentions is exactly where v2's declarative model breaks down, and a class-based shape is the right shape for that.

Small things noted but not blockers (RFC stage)

Quickly, since this is RFC and ergonomics matter more than polish right now: Replace.apply() pops the existing cluster before constructing the replacement (loses the original on construction failure); resolve_device logs at warning level on every matched join (should be debug once de-experimentalized); the broad except Exception in _discover_new_entities will silently swallow real construction bugs and could use the entity / endpoint / cluster handler in the log message; pyproject.toml is now zha-quirks>=1.2.0, which until the first quirks release that imports ZHA will pull in versions that don't actually use the new API. None of these change the design discussion.

What would let me sign off later: clarity on items 1 (target end-state for zigpy) and 2 (deprecation story for the three mechanisms), and a small thinking-out-loud doc on item 3 (which Device/Light surfaces are public quirks API). The rest is concrete code asks that flow naturally once those are settled.

Comment thread zha/zigbee/device.py Outdated
Comment thread zha/zigbee/device.py Outdated
Comment thread zha/zigbee/device.py
NUMBER = "number"
SENSOR = "sensor"
SELECT = "select"
SWITCH = "switch"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EntityPlatform and EntityType are defined here, but tests still from zigpy.quirks.v2.homeassistant import EntityPlatform, EntityType (e.g. tests/test_sensor.py:18, tests/test_device.py:19) and production from zigpy.quirks.v2 import EntityMetadata, ZCLEnumMetadata, … in ~10 places. Coexistence during migration is fine; flagging so the cleanup pass doesn't get lost when the companion zigpy PR drops the zigpy copies. The goal-state is presumably that zha.application.EntityPlatform is the only definition and zigpy.quirks.v2.homeassistant is gone.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production side is done — zha.application owns EntityType/EntityPlatform and no zigpy.quirks.v2 imports remain under zha/. The test suite isn't migrated yet: tests/test_sensor.py:21, tests/test_device.py:19-20, tests/test_button.py:19 (and others) still from zigpy.quirks.v2[.homeassistant] import …, now resolving through the deprecation shims. Leaving this thread open until that cleanup lands.

@jaredjxyz

jaredjxyz commented Jun 2, 2026

Copy link
Copy Markdown

Hey @puddly I stumbled upon this after doing a little investigating myself into what it would take to get the Hue bulb's 0xFC03 working myself - and came to a lot of the same conclusions you seem to have come to (that the responsibilities are backward and you should be implementing the quirk behavior in the quirk) in this PR, so happily surprised to see this being worked on.

Crossing my fingers that something like this eventually lands, It looks like just a POC for now, lots of changes here. But it looks to my (again, relatively new to the codebase but experienced in SW development) eye to be a promising direction, at least for my use case

Thanks for working on this!

@Hedda

Hedda commented Jun 10, 2026

Copy link
Copy Markdown

Hey @puddly I stumbled upon this after doing a little investigating myself into what it would take to get the Hue bulb's 0xFC03 working myself - and came to a lot of the same conclusions you seem to have come to (that the responsibilities are backward and you should be implementing the quirk behavior in the quirk) in this PR, so happily surprised to see this being worked on.

@jaredjxyz For reference, FYI, also see these related requests/discussions asking for zha and ZHA Device Handlers quirks to be able to nativly support and handle these type of "Hue native control" Dynamic Scenes Effects / Animations. Checkout:

and

Note! Another device that could also use native control of effect commands via quirks is Lidl Christmas led string HG0646:

Lidl Christmas led string HG0646 uses cluster 0x00FC to invoke invoked native effects commands for their device:

@Orange-GT3

Copy link
Copy Markdown

@puddly - I'm putting my hand up to help test when the time is right. If that is out of the normal beta cycle, that's OK too but I'll need a muppet-proof method to install it.

I don't normally engage in testing the betas but I'm so keen to have full Hue support that I will for this :).

@puddly puddly force-pushed the puddly/decouple-zigpy branch from 77c0f49 to 4feb202 Compare June 19, 2026 00:01

@zigpy-review-bot zigpy-review-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_async_device_reinterviewed: the old_entry is not new_entry swap silently drops entity updates

The swap branch fires whenever the resolved quirk entry changes:

else:
    await zha_device.async_teardown(emit_entity_events=True)
    zha_device = Device.new(new_zigpy_device, self)
    self._devices[new_zigpy_device.ieee] = zha_device

This is both broader than necessary and, for a common case, incorrect: a firmware change that selects a different v2 quirk (different entities, but still the base Device class) hits this branch, and the entity changes never reach HA. The in-place rebuild path it should take handles them correctly.

Full trace: why the swap drops entity updates and the in-place path doesn't (CLICK TO EXPAND)

The bug — firmware-version quirks lose their entity changes in HA. A very common reinterview trigger is a firmware change selecting a different v2 quirk that adds/removes entities but still resolves to the base Device class (no custom zha_device_factory). Those hit the else branch today, and the new entities never reach HA:

  • Device.new returns a fresh object with self._initialized = False.
  • On its async_initialize, await self._add_pending_entities(emit_event=self._initialized) runs with emit_event=False, so entities are added silently — no DeviceEntityAddedEvent.
  • HA's handle_device_fully_initialized (the event you emit afterward) fetches the proxy and only rebuilds entity metadata when event.new_join is True — false for a reinterview. And _async_get_or_create_device_proxy returns the stale proxy (still bound to the torn-down object), so HA isn't tracking the new device at all.

Result: a firmware update that should add/remove entities produces no entity change in HA.

The in-place path already does this correctly. async_rebuild_from_zigpy_device keeps object identity, and _initialized stays True across it (neither async_teardown nor _init_from_zigpy_device resets it). So its async_initialize runs _add_pending_entities(emit_event=True)DeviceEntityAddedEvent per entity → HA's handle_zha_device_entity_added_event creates them; teardown's soft removals (remove=False) unload the old ones while preserving their registry entries (so unchanged unique_ids re-attach with customizations intact). The emit_event=self._initialized gate is designed for exactly this ("Emit events only on re-initialization, not the first").

Fix: narrow the swap to genuine Device-class changes

Only swap when the resolved class actually differs (custom zha_device_factory); otherwise rebuild in place — _init_from_zigpy_device re-reads the new quirk's metadata, so the new entity set is picked up:

def _factory(entry):
    return entry.zha_device_factory if (entry and entry.zha_device_factory) else Device

if _factory(old_entry) is _factory(new_entry):
    await zha_device.async_rebuild_from_zigpy_device(new_zigpy_device)   # same object, entities propagate
else:
    ...  # Device.new — genuine class change

This routes all standard/firmware-version quirk changes onto the working path and confines the swap to genuine base⇄custom-subclass transitions.

The residual class-change swap still needs a paired HA Core change (CLICK TO EXPAND)

For the remaining case — a device-level custom-Device quirk applied or removed via reinterview — the new object is unavoidable (you can't cleanly re-class a live instance), and current HA Core can't follow the swap:

  • ZHADeviceProxy.__init__ binds self.device and subscribes self.device.on_all_events(...) once, and .device is never reassigned anywhere.
  • _async_get_or_create_device_proxy is keyed by IEEE and returns the existing proxy unchanged — no repoint, no re-subscribe.
  • There's no handle_device_reinterviewed; the DeviceFullInitEvent you emit lands on the stale proxy and skips entity rebuild (new_join is False).

So after a genuine class-change swap, HA keeps a proxy bound to the torn-down object, the new device's events never reach it, and entities aren't refreshed. (The HA device-registry entry survives — it's keyed by IEEE — so it's "only" stale entities/events, but it's user-visible.)

The HA side needs:

  1. a repointable ZHADeviceProxy (async_rebind: unsub → set self.device → re-subscribe on_all_events + the sw-version sub),
  2. _async_get_or_create_device_proxy calling it when an existing proxy wraps a different object, and
  3. running the existing _create_entity_metadata + SIGNAL_ADD_ENTITIES path on a swap (today gated behind new_join).

To make the rebind deterministic rather than inferred from DeviceFullInitEvent + an object-identity mismatch, it'd help for ZHA to emit a dedicated DeviceReinterviewedEvent/DeviceReplacedEvent before async_configure/async_initialize. Worth tracking on roadmap#93 as a multi-repo item.

Suggested test

Reinterview a device into a quirk (same Device class) that adds one entity and drops another; assert the corresponding DeviceEntityAddedEvent/DeviceEntityRemovedEvent fire and that re-init keeps _initialized=True (it's load-bearing and implicit — a future reset in teardown would silently reintroduce this bug).

Reference: re-interview event flow today

Event sequence for a re-interview today (CLICK TO EXPAND)

After zigpy fires device_reinterviewed, Gateway._async_device_reinterviewed runs the sequence below. Entity/configure events are emitted on the ZHA Device's stream (consumed by ZHADeviceProxy, subscribed via device.on_all_events); DeviceFullInitEvent is gateway-level (consumed by ZHAGatewayProxy, subscribed via gateway.on_all_events).

# ZHA step Event emitted HA handler → effect
1 async_teardown(emit_entity_events=True) DeviceEntityRemovedEvent (remove=False) × existing entities handle_zha_device_entity_removed_event → soft unload via SIGNAL_REMOVE_ENTITY_*; registry entry kept
2a in-place: _init_from_zigpy_device(new)
2b swap: Device.new(new) + self._devices[ieee] = … — (proxy still bound to the old object)
3 async_configure()emit_reconfigure_done() DeviceConfiguredEvent handle_zha_device_configuredSIGNAL_DEVICE_RECONFIGURE_EVENT (zha_channel_cfg_done), unsticks the reconfigure dialog
4 async_initialize()_add_pending_entities(emit_event=self._initialized) DeviceEntityAddedEvent × new entities — only when _initialized is True handle_zha_device_entity_added_event → append EntityData + SIGNAL_ADD_ENTITIES
5 group.update_entity_subscriptions() group membership signals group handlers refresh subscriptions
6 if all succeeded DeviceFullInitEvent(CONFIGURED) (gateway-level) handle_device_fully_initialized → re-fetches proxy; rebuilds entity metadata only if new_join (False on reinterview)

In-place path (same object, _initialized stays True): steps 1, 3 and 4 all fire on the object HA's proxy is subscribed to, so removals, the dialog-unstick, and the adds all reach HA → entities reconcile correctly.

Swap path (new object, fresh _initialized=False): step 1 still reaches HA (it fires on the old object, before the swap). But steps 3 and 4 fire on the new object — which HA's stale proxy isn't subscribed to — and step 4 is additionally silent (emit_event=False). The only thing that reaches HA is the gateway-level DeviceFullInitEvent (step 6), and it does nothing useful (new_join is False, and it just re-fetches the same stale proxy). Net: removals land, but the dialog-unstick and the new entities don't.

@zigpy-review-bot zigpy-review-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The factory-identity check doesn't fix the common case (v2 QuirkBuilder changes still swap)

Following up on f17cb124: comparing zha_device_factory by identity (new_factory is old_factory) only newly changes behavior (vs the old old_entry is new_entry) for no-factory entries — still the minority. It's effectively a no-op for v2 QuirkBuilder quirks, which keep taking the swap branch and reproduce the original entity-event bug.

Why: QuirkBuilder mints a fresh per-quirk factory — QuirkV2Factory(base, quirk_definition) (zhaquirks/builder/builder.py:1086) — and for any standard v2 quirk base = QuirkV2Device (:1085), i.e. the same class for all of them. So two different v2 quirks hold distinct factory objects of the same class, and new_factory is old_factory is always False. Since the factory is 1:1 with its registry entry, for anything carrying a factory the new check is equivalent to the old old_entry is new_entry. The only transitions it newly sends to in-place are between no-factory entries — which the comparison normalizes to Device, and which include both no quirk and v1/legacy quirks (legacy entries set zha_device_factory=None, zhaquirks/__init__.py:551). That's exactly (and only) what test_reinterview_in_place_events exercises; test_reinterview_replaces_device_when_quirk_changes (:650) uses two QuirkBuilder quirks and asserts new_zha_device is not zha_device, locking the swap in for the common case.

What actually happens today (f17cb124)

Reinterview transition Path HA outcome
No quirk → no quirk (clusters/endpoints change) in-place ✅ add/remove events reconcile
Same v2 quirk, re-interviewed in-place
Different v2 quirk, same QuirkV2Device base (model/FW switch) swap ❌ removals fire from the old object, but adds are silent and the proxy stays stale → new entities never appear
v2 quirk added / removed swap ❌ same as above
No-factory ⇄ no-factory (no quirk, v1/legacy — incl. one v1 → another) in-place ✅ (the one case the fix newly enables)
Same custom ZHA Device subclass, re-interviewed in-place ✅ if the subclass tolerates the _init_from_zigpy_device() rebuild
Custom ZHA Device subclass changes (device-level quirk) swap ❌ (swap genuinely required)

Rows 3–4 are the common QuirkBuilder cases the fix was meant to cover but doesn't.

A class-only comparison is necessary but not sufficient

QuirkV2Device binds its definition at construction — self._quirk_definition = quirk_definition (zhaquirks/builder/device.py:37) — and doesn't override _init_from_zigpy_device, which re-reads the registry entry for quirk_class but never refreshes _quirk_definition. So forcing an in-place rebuild for a changed v2 quirk would keep the old definition → wrong entities/metadata, trading the missing-entity bug for a stale-metadata one. Any fix that just "rebuilds in place when the class is unchanged" has to also re-bind the new definition.

Two viable fixes

  1. Compare the produced class and refresh the definition in place — compare QuirkV2Factory.base (and the directly-used class for custom-Device quirks) instead of factory identity, and have the in-place rebuild re-bind _quirk_definition from the newly-resolved entry. Same-class changes then reconcile via events with the correct definition; only a genuine class change swaps.
  2. Keep the swap, make HA follow it — since QuirkV2Device is definition-bound by design, a fresh object per definition change is arguably the cleaner model, in which case the fix belongs on the HA side: a repointable ZHADeviceProxy (async_rebind) + entity refresh, ideally driven by a dedicated DeviceReinterviewedEvent emitted before configure/initialize (roadmap#93). This covers v2-definition changes and the genuine custom-Device-class swap uniformly, so it's the broader fix.

Test gap

Add a case that reinterviews between two QuirkBuilder quirks (e.g. firmware-version filters) that change the entity set, asserting the add/remove events fire and HA's entities reconcile — the scenario test_reinterview_in_place_events (no-factory entries) doesn't reach, and where the bug actually lives.


Correction: Option 1 can't cover applying or removing a quirk — so Option 2 is the load-bearing fix

The two options above aren't co-equal. A class-comparison fix (Option 1) can only ever keep object identity when the ZHA Device class is unchanged — but applying or removing a quirk is a class change:

  • non-quirked device → Device.new falls through to base Device;
  • v2-quirked device → QuirkV2Factory produces QuirkV2Device (a subclass of Device).

So DeviceQuirkV2Device is a genuine class change, and Option 1 still swaps it:

Transition Class change Under Option 1
v2_A → v2_B (same QuirkV2Device base) no ✅ fixed (in-place + definition refresh)
no quirk → v2 quirk DeviceQuirkV2Device ❌ still swaps
v2 quirk → no quirk QuirkV2DeviceDevice ❌ still swaps
v1 ↔ v2 DeviceQuirkV2Device ❌ still swaps
custom subclass changes yes ❌ still swaps

"No quirk → v2 quirk" is arguably the most common reinterview-driven transition (a firmware-gated quirk starting to match after an OTA, or a device that joined before its quirk existed and is reinterviewed after a zha-device-handlers update) — and Option 1 can't help it.

So Option 2 (make the swap HA-safe: repointable ZHADeviceProxy + entity refresh, ideally via a dedicated DeviceReinterviewedEvent) is required — it's the only thing that fixes applying/removing a quirk and the custom-subclass case. Option 1 is just a partial optimization that keeps v2→v2-same-base on the in-place path; it is not a substitute for Option 2.

@TheJulianJES TheJulianJES changed the title RFC: Migrate quirks out of zigpy and provide direct entity access Migrate quirks out of zigpy and provide direct entity access Jun 24, 2026
@puddly puddly marked this pull request as ready for review June 24, 2026 01:08
Copilot AI review requested due to automatic review settings June 24, 2026 01:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@puddly

puddly commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

The device swap out issue will be handled within Core: we can check device object identity to detect this.

There is a circular dependency between zha-device-handlers and this repo (for CI) so this will have to be merged bypassing checks.

Comment thread tests/data/devices/ubisys-j1-5502.json Outdated
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.29%. Comparing base (a2ddc14) to head (36e9d31).

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #762      +/-   ##
==========================================
- Coverage   97.41%   97.29%   -0.13%     
==========================================
  Files          50       55       +5     
  Lines       10419    10932     +513     
==========================================
+ Hits        10150    10636     +486     
- Misses        269      296      +27     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheJulianJES TheJulianJES left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough to me 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Further shape Zigbee support by implementing missing features in ZHA while retaining choice

7 participants