Skip to content

[codex] Reduce serializer hot-path allocations#317

Merged
vinitkumar merged 2 commits into
masterfrom
codex/deep-memory-optimization
Jun 9, 2026
Merged

[codex] Reduce serializer hot-path allocations#317
vinitkumar merged 2 commits into
masterfrom
codex/deep-memory-optimization

Conversation

@vinitkumar

@vinitkumar vinitkumar commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Summary

  • Reduce pure-Python XML serializer hot-path allocations by streaming nested dict/list output through the shared writer path.
  • Keep typed attribute formatting centralized through make_attrstring() while shallow-copying only when the type value must be overlaid.
  • Localize @attrs / @val normalization in dict element handling and remove the extra skip_attrs / skip_key plumbing.
  • Factor list item base-attribute construction so id handling is shared across scalar, dict, list, date, and null branches.
  • Add regression coverage for typed attrs, invalid-name list metadata with ids, coercible @attrs pairs, and fast-wrapper fallback behavior.

Review Follow-Up

Addressed the Sourcery complexity review by:

  • Replacing the custom make_typed_attrstring() assembly logic with a shallow copy plus make_attrstring().
  • Normalizing @attrs once inside _append_dict2xml_str(), then passing normal raw payloads through _append_rawitem().
  • Removing skip_attrs and skip_key from the recursive append helpers.
  • Restoring _XML_ESCAPE_CHARS.intersection(s) for the XML escaping fast path.
  • Updating the architecture notes so they describe the final centralized attribute behavior.

Benchmark

Compared branch head 00c0540 against baseline commit 9463457 using Python 3.14.4. Payload construction happens before tracing; each case warms once, then records 5 conversion samples with time.perf_counter() and tracemalloc.

Case Baseline mean New mean Time change Baseline peak New peak Peak change
attrs_nested / 8,000 records 402.787 ms 328.192 ms -18.5% 2,881,956 B 2,872,548 B -9,408 B
plain_strings / 12,000 records 687.463 ms 534.795 ms -22.2% 4,796,815 B 4,787,079 B -9,736 B

Peak memory moves modestly because the final XML bytes dominate the traced peak; the main win is removing transient allocations and repeated work along recursive serializer paths.

Validation

  • python3 -m ruff check json2xml/dicttoxml.py json2xml/dicttoxml_fast.py tests/test_dicttoxml_unit.py tests/test_dict2xml.py
  • python3 -m pytest --cov=json2xml --cov-report=term-missing --cov-fail-under=100 (401 passed, 100.00%)
  • lat check (passes; reports the existing missing LLM key warning for semantic search)
  • git diff --check

Avoid per-call set intersections in XML escaping, skip attribute dictionary copies on typed scalar and @attrs paths, avoid scalar list item attr dict allocation where metadata is already reusable, and pass namespace defaults through the fast wrapper without materializing an empty mapping.\n\nBenchmarked against 9463457 with 5 warm samples per case:\n- attrs_nested: 552.518ms -> 359.204ms mean, peak 16,590,571 -> 16,580,643 bytes\n- plain_strings: 800.088ms -> 530.016ms mean, peak 9,538,330 -> 9,528,594 bytes
@vinitkumar vinitkumar marked this pull request as ready for review June 9, 2026 18:33
@sourcery-ai

sourcery-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Optimizes XML serialization hot paths to reduce per-value allocations by eliminating unnecessary data structure copies, refining attribute handling (including typed attributes and @attrs), reusing metadata, and aligning the fast-path entry point with the pure-Python implementation, with regression tests and documentation updates to lock in behavior.

Sequence diagram for list item serialization with @attrs and IDs

sequenceDiagram
    participant Client
    participant dicttoxml as dicttoxml
    participant convert_list as _append_convert_list
    participant dict2xml as _append_dict2xml_str
    participant rawitem as _append_rawitem
    participant conv_dict as _append_convert_dict

    Client->>dicttoxml: dicttoxml(obj, ids=True)
    dicttoxml->>convert_list: _append_convert_list(items)
    convert_list->>dict2xml: _append_dict2xml_str(item_dict, ids, attr_type, ...)
    activate dict2xml
    dict2xml->>dict2xml: detect has_custom_attrs
    dict2xml->>dict2xml: val_attr = item["@attrs"] or attr
    dict2xml->>dict2xml: rawitem = item
    dict2xml->>rawitem: _append_rawitem(rawitem, ids, attr_type, ..., skip_attrs=True)
    deactivate dict2xml

    activate rawitem
    rawitem->>rawitem: [rawitem is not scalar]
    rawitem->>conv_dict: _append_convert_dict(rawitem, ids, ..., skip_key="@attrs")
    deactivate rawitem

    activate conv_dict
    conv_dict->>conv_dict: iterate keys, skip "@attrs"
    conv_dict-->>Client: write child elements without copying attrs
    deactivate conv_dict
Loading

File-Level Changes

Change Details Files
Optimize XML escaping and attribute string construction to avoid unnecessary allocations while preserving behavior.
  • Remove global frozenset of XML escape chars and replace with direct character membership checks to fast-path strings that need no escaping.
  • Refine make_attrstring to use a generator expression instead of a list comprehension for joining attribute key/value pairs.
  • Introduce make_typed_attrstring helper that emits a type attribute without copying or mutating caller attribute dicts while preserving ordering and handling existing 'type' keys.
json2xml/dicttoxml.py
Improve dict and list serialization hot paths to reuse attribute metadata, skip internal keys when needed, and normalize boolean output.
  • Change @attrs handling in _append_dict2xml_str to reuse caller attribute mappings when safe, accept non-dict coercible pairs, and pass a skip_attrs flag to prevent double emission of @attrs content.
  • Extend _append_rawitem to accept a skip_attrs flag, short-circuiting to _append_convert_dict with a skip_key when @attrs is present and no @Val is provided.
  • Modify _append_convert_dict to accept an optional skip_key parameter and skip that key in iteration, avoiding rebuilding dicts without @attrs.
  • Refactor _append_convert_list to avoid constructing a new base attr dict when ids are disabled, reusing precomputed scalar and item_name attribute dicts, and to emit booleans as explicit 'true'/'false' strings.
  • Update convert_kv_valid_name, convert_bool_valid_name, and convert_none_valid_name to use make_typed_attrstring instead of copying and mutating attr dicts, and to standardize boolean text output to 'true'/'false'.
json2xml/dicttoxml.py
Align namespace handling and fast serializer behavior with the pure-Python implementation while reducing temporary allocations.
  • Accumulate XML namespace fragments into a list and join once, instead of repeatedly concatenating namespace_str in dicttoxml.
  • Change dicttoxml_fast.dicttoxml to pass xml_namespaces through unchanged (possibly None) so the fast path mirrors the pure-Python default handling and avoids eagerly materializing an empty dict.
json2xml/dicttoxml.py
json2xml/dicttoxml_fast.py
Add and refine tests to lock in typed attribute semantics, @attrs coercion, and invalid XML name behavior with IDs.
  • Add unit test ensuring valid-name scalar helpers replace type attributes in emitted XML without mutating or reordering caller attribute dictionaries.
  • Add regression test that @attrs accepts any input coercible by dict() (e.g., list of pairs) while preserving legacy behavior and output formatting.
  • Tighten invalid-name list conversion tests to assert correct id emission, name metadata, and single-escaped name attributes for scalar list items.
  • Update LAT test documentation to describe expectations for special attributes, typed attributes, and XML name normalization without double escaping.
tests/test_dicttoxml_unit.py
tests/test_dict2xml.py
lat.md/tests.md
Document hot-path serializer behavior and attribute handling in the architecture notes.
  • Extend architecture docs to mention that dicttoxml hot scalar and @attrs paths avoid copying attribute dictionaries and that public helpers delegate to the same streaming append logic for efficiency.
lat.md/architecture.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (9463457) to head (00c0540).

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #317   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            6         6           
  Lines          609       616    +7     
=========================================
+ Hits           609       616    +7     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="json2xml/dicttoxml.py" line_range="165" />
<code_context>
     return f" {attrstring}"


+def make_typed_attrstring(attr: dict[str, Any], xml_type: str) -> str:
+    """Create XML attributes with a type value without copying caller attrs."""
+    if not attr:
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying new helpers and control flow (typed attributes, @attrs handling, list attr construction, and XML escaping) to reuse existing abstractions and avoid scattered special cases.

You can keep the behavioral changes while reducing the new complexity in three focused spots:

---

### 1. `make_typed_attrstring` can be simplified via `make_attrstring`

Right now `make_typed_attrstring` reimplements most of `make_attrstring`, adds branching, and makes attribute ordering harder to reason about.

You can reuse the existing abstraction and keep the “don’t mutate caller attrs” guarantee by shallow-copying:

```python
def make_typed_attrstring(attr: dict[str, Any], xml_type: str) -> str:
    """Create XML attributes with a type value without mutating caller attrs."""
    if not attr:
        return f' type="{xml_type}"'

    typed_attr = dict(attr)  # shallow copy
    typed_attr["type"] = xml_type
    return make_attrstring(typed_attr)
```

Usage sites (e.g. `convert_kv_valid_name`, `convert_bool_valid_name`, `convert_none_valid_name`) remain the same:

```python
attr_string = make_typed_attrstring(attr, get_xml_type(val)) if attr_type else make_attrstring(attr)
```

This removes the custom assembly logic and keeps attribute behavior centralized in `make_attrstring`.

---

### 2. Localize `@attrs` handling instead of `skip_attrs` / `skip_key` plumbing

The current flow:

- `_append_dict2xml_str` computes `has_custom_attrs`, `raw_attrs`, `val_attr`
- Passes `skip_attrs` into `_append_rawitem`
- `_append_rawitem` may call `_append_convert_dict` with `skip_key="@attrs"`

This spreads a single concern across three functions.

You can normalize the dict once in `_append_dict2xml_str` and then call `_append_rawitem`/`_append_convert_dict` without extra flags:

```python
def _append_dict2xml_str(...):
    ...
    has_custom_attrs = "@attrs" in item
    if has_custom_attrs:
        raw_attrs = item["@attrs"]
        val_attr = raw_attrs if isinstance(raw_attrs, dict) else dict(raw_attrs)
        # strip @attrs when it's not the scalar payload
        rawitem = item["@val"] if "@val" in item else {
            k: v for k, v in item.items() if k != "@attrs"
        }
    else:
        val_attr = attr
        rawitem = item.get("@val", item)
    ...

    if parentIsList and list_headers:
        ...
        _append_rawitem(
            output,
            rawitem,
            ids,
            attr_type,
            item_func,
            cdata,
            item_wrap,
            item_name,
            list_headers,
        )
    ...
```

Then `_append_rawitem` and `_append_convert_dict` no longer need `skip_attrs` / `skip_key`:

```python
def _append_rawitem(...):
    if rawitem is None:
        return
    if isinstance(rawitem, bool):
        output.write("true" if rawitem else "false")
    elif isinstance(rawitem, (str, numbers.Number)):
        output.write(escape_xml(str(rawitem)))
    else:
        _append_convert(
            output,
            rawitem,
            ids,
            attr_type,
            item_func,
            cdata,
            item_wrap,
            item_name,
            list_headers=list_headers,
        )

def _append_convert_dict(...):  # remove skip_key
    for key, val in obj.items():
        ...
```

The behavior for `@attrs`/`@val` stays the same, but all the special-casing is localized to a single, easy-to-find place.

---

### 3. Factor out `attr` construction in `_append_convert_list`

The new version duplicates `attr` construction logic in each branch and repeats the `id` handling. You can compute a base attribute once per iteration and layer type-specific attrs on top:

```python
for i, item in enumerate(items):
    base_attr: dict[str, Any] | None = None
    if ids:
        base_attr = {"id": f"{this_id}_{i + 1}"}

    if isinstance(item, bool):
        attr = dict(base_attr) if base_attr else {}
        if item_name_attr:
            attr.update(item_name_attr)
        output.write(convert_bool_valid_name(item_name, item, attr_type, attr))

    elif isinstance(item, (numbers.Number, str)):
        attr = dict(base_attr) if base_attr else {}
        if scalar_key_attr:
            attr.update(scalar_key_attr)
        output.write(
            convert_kv_valid_name(
                key=scalar_key,
                val=item,
                attr_type=attr_type,
                attr=attr,
                cdata=cdata,
            )
        )

    elif hasattr(item, "isoformat"):
        attr = dict(base_attr) if base_attr else {}
        if item_name_attr:
            attr.update(item_name_attr)
        output.write(
            convert_kv_valid_name(
                key=item_name,
                val=item.isoformat(),
                attr_type=attr_type,
                attr=attr,
                cdata=cdata,
            )
        )

    elif isinstance(item, dict):
        attr = {} if not base_attr else dict(base_attr)
        _append_dict2xml_str(..., attr=attr, ...)

    elif isinstance(item, Sequence):
        attr = {} if not base_attr else dict(base_attr)
        _append_list2xml_str(..., attr=attr, ...)

    elif item is None:
        attr = dict(base_attr) if base_attr else {}
        if item_name_attr:
            attr.update(item_name_attr)
        output.write(convert_none_valid_name(item_name, attr_type, attr))
    ...
```

This keeps the per-branch logic focused on type-specific behavior while centralizing the `id`/base-attr handling.

---

### 4. `escape_xml` early-return condition

The previous `_XML_ESCAPE_CHARS.intersection(s)` approach was shorter and more self-documenting than the expanded chain of `not in` checks. You can restore the named constant and keep the optimization:

```python
_XML_ESCAPE_CHARS = frozenset("&\"'<>")

def escape_xml(s: str | numbers.Number) -> str:
    if isinstance(s, str):
        if not _XML_ESCAPE_CHARS.intersection(s):
            return s
        ...
```

This keeps the fast path readable without adding conceptual complexity.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread json2xml/dicttoxml.py
@vinitkumar vinitkumar merged commit 43543e8 into master Jun 9, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant