Fix dataframe_to_mds for non-nullable ArrayType columns by discobot · Pull Request #984 · mosaicml/streaming

discobot · 2026-06-13T13:50:07Z

Description of changes:

dataframe_to_mds rejects array columns declared with containsNull=False, as reported in the issue.

This PR implements the element-type lookup suggested there: map_spark_dtype now normalizes arrays to ArrayType(elementType) before the SPARK_TO_MDS lookup, mirroring the existing DecimalType normalization directly above. Both call sites are covered (user-defined-columns validation and automatic schema inference), and the change is safe because the MDS ndarray:* encodings don't depend on array nullability.

Added regression tests: schema inference on a containsNull=False array column, plus end-to-end conversion with both auto-inferred and user-defined columns. They fail without the source change; with it, tests/base/converters/test_dataframe_to_mds.py passes locally (24 passed).

Issue #, if available:

Fixes dataframe_to_mds fails with non-nullable ArrayType #870

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

_{Authored with the help of Claude Code.}

PySpark ArrayType equality includes containsNull, so the SPARK_TO_MDS lookup missed array columns declared with containsNull=False (e.g. produced by concat/array over non-null columns) and raised 'is not supported by dataframe_to_mds'. Normalize the lookup to the element type only, mirroring the existing DecimalType handling. Adds regression tests covering schema inference and end-to-end conversion for non-nullable arrays.

discobot mentioned this pull request Jun 13, 2026

dataframe_to_mds fails with non-nullable ArrayType #870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataframe_to_mds for non-nullable ArrayType columns#984

Fix dataframe_to_mds for non-nullable ArrayType columns#984
discobot wants to merge 1 commit into
mosaicml:mainfrom
discobot:fix/870-arraytype-containsnull

discobot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

discobot commented Jun 13, 2026

Description of changes:

Issue #, if available:

Merge Checklist:

General

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant