Make FHIRPath evaluation safe under Spark ANSI mode by johngrimes · Pull Request #2630 · aehrc/pathling

johngrimes · 2026-06-18T09:40:09Z

Spark 4 enables spark.sql.ansi.enabled by default, which makes invalid or out-of-range casts and numeric overflow raise errors rather than returning empty. In practice that meant a single non-conforming value in real-world FHIR data could abort an entire query. It also left open the concern raised in the issue: consumers who disable ANSI mode to run portable transform SQL on the same session might be unknowingly masking a Pathling incompatibility.

This makes the core fhirpath and encoders evaluation produce identical, correct results regardless of the setting. Failure-prone decimal and Quantity casts now route through a single safe-cast helper built on try_cast, the +/-/* operators use their try_* counterparts so overflow yields empty, and count/isEmpty/toBoolean no longer rely on size(null) (which Spark resolves differently depending on the flag). On failure the result is empty/NULL under both settings, matching FHIRPath and SQL on FHIR semantics, and no production code reads or sets the flag.

A new parameterised test suite evaluates the divergent cases under both settings and asserts identical results, and the test harness now asserts it starts under ANSI-on so the supported default cannot regress unnoticed. The full fhirpath and encoders suites pass with ANSI mode both on and off. The Spark configuration documentation now records that either setting is supported.

Closes #2629.

Spark 4 enables spark.sql.ansi.enabled by default, which causes arithmetic overflow, out-of-range casts, and size(null) to raise exceptions instead of returning NULL/lenient values. Switch the arithmetic operators (add, subtract, multiply) to their try_* counterparts and replace elementCast with elementTryCast (built on per-element transform+try_cast) so a non-conforming value yields NULL rather than aborting the query, independently of the session setting. Guard size(null) explicitly in count/isEmpty/toBoolean so both ANSI settings produce the same result. Apply the same safe-cast treatment to Quantity encoding and decimal normalisation. Add test infrastructure (AnsiTestSupport, AnsiHarnessAssertionTest, AnsiModeIndependenceTest, ColumnRepresentationSafeCastTest) that runs the core FHIRPath suite under both ANSI-on (the Spark 4 default) and ANSI-off, and assert that the harness itself starts under ANSI-on so that regression is caught early.

dbplyr 2.6.0 is incompatible with sparklyr 1.9.4. Its query-fields probe emits standard SQL with double-quoted identifiers (SELECT 0L AS "path"), which the Spark parser rejects, breaking every tbl_spark operation in the R tests. sparklyr declares no upper bound on dbplyr, so CI installed 2.6.0 within hours of its release and the R module started failing. Pin dbplyr to the last working release after installing the dev dependencies, so the version holds regardless of what the package cache restored. Remove once a sparklyr release supports dbplyr 2.6.0. (cherry picked from commit 220d94f)

Clarify that the size(null) guards in count/isEmpty/toBoolean make the result independent of both spark.sql.legacy.sizeOfNull and spark.sql.ansi.enabled, rather than implying ANSI-off alone diverges. Mirror the fail-loud ANSI assertion from the fhirpath harness into the encoders test support so the encoders suite cannot silently run under the wrong mode if a Spark default ever changes. Remove the unused sum() representation and its test: it was the last remaining raw arithmetic operator and had no production callers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul

Looks good to me.
I have committed three minor changes.

Clarify that the size(null) guards in count/isEmpty/toBoolean make the
result independent of both spark.sql.legacy.sizeOfNull and
spark.sql.ansi.enabled, rather than implying ANSI-off alone diverges.

Mirror the fail-loud ANSI assertion from the fhirpath harness into the
encoders test support so the encoders suite cannot silently run under
the wrong mode if a Spark default ever changes.

Remove the unused sum() representation and its test: it was the last
remaining raw arithmetic operator and had no production callers.

sonarqubecloud · 2026-06-19T01:46:38Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

This reverts merge commit 2f95b3b, which merged the issue/2629 changes into main by mistake. Those changes target the 9.8.0 release and are being merged into release/9.8.0 instead. Because the merge commit remains in history, promoting release/9.8.0 to main later will not re-introduce these changes automatically. This revert must itself be reverted (or the changes re-applied) at that time.

The dbplyr pin that keeps the R API build working was removed when the accidental merge of #2630 was reverted. Without it, dependency installation pulls dbplyr 2.6.0, whose query-fields probe emits SQL with double-quoted identifiers that the Spark parser rejects, failing every tbl_spark operation in the R tests. Re-add the pin so the R module builds again.

The issue #2629 ANSI-mode safety changes were correctly re-merged into the release branch but were then silently removed when origin/main was merged in, because main still carried a revert of the original #2630 merge. The 3-way merge resolved the reverted side as a deletion, dropping the fix and its tests with no conflict and no test failure to signal the loss. Restore the reviewed fix files to their post-review state so that 9.8.0 ships safe under Spark 4's default ANSI mode: try_* arithmetic operators, elementTryCast for non-conforming casts, guarded size(null) in count/isEmpty/toBoolean, safe-cast Quantity encoding and decimal normalisation, and the dual-mode ANSI test harness.

johngrimes added this to Pathling Jun 18, 2026

github-project-automation Bot moved this to Backlog in Pathling Jun 18, 2026

johngrimes moved this from Backlog to In progress in Pathling Jun 18, 2026

johngrimes marked this pull request as ready for review June 18, 2026 20:41

johngrimes requested a review from piotrszul June 18, 2026 20:43

johngrimes mentioned this pull request Jun 18, 2026

Remove the dbplyr pin once sparklyr supports dbplyr 2.6.0 #2631

Open

piotrszul approved these changes Jun 19, 2026

View reviewed changes

johngrimes merged commit 2f95b3b into main Jun 19, 2026
5 checks passed

github-project-automation Bot moved this from In progress to Done in Pathling Jun 19, 2026

johngrimes deleted the issue/2629 branch June 19, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make FHIRPath evaluation safe under Spark ANSI mode#2630

Make FHIRPath evaluation safe under Spark ANSI mode#2630
johngrimes merged 3 commits into
mainfrom
issue/2629

johngrimes commented Jun 18, 2026

Uh oh!

piotrszul left a comment

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johngrimes commented Jun 18, 2026

Uh oh!

piotrszul left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants