Skip to content

v9.8.0#2632

Merged
johngrimes merged 26 commits into
mainfrom
release/9.8.0
Jun 20, 2026
Merged

v9.8.0#2632
johngrimes merged 26 commits into
mainfrom
release/9.8.0

Conversation

@johngrimes

Copy link
Copy Markdown
Member

No description provided.

johngrimes and others added 23 commits June 12, 2026 13:00
Add a `pathling` console script to the Python package that surfaces the
library's functionality through a flat, verb-based command tree: data
conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and
terminology operations over code datasets. The CLI is built with Click and
Rich, resolves configuration from flags then an optional TOML file, and is
installable and runnable via `uv tool` / `uvx`.

Package public names are now exposed lazily so that importing the CLI does
not pull in PySpark, keeping `--help` and `--version` fast. Spark and JVM log
output is suppressed by default and re-enabled with `--verbose`, errors are
rendered as concise messages without Java stack traces, and worker Python is
pinned to the driver interpreter to avoid version mismatches.
The run command executes user-supplied Python code from a script file,
stdin, or an inline -c option with spark and pathling variables already
bound, reproducing Python interpreter semantics for sys.argv, __main__,
__file__, sys.path, tracebacks, and SystemExit propagation. The console
command opens an interactive IPython session over the same namespace,
preceded by a banner naming the version and in-scope variables.

Both commands reuse the existing configuration resolution and quiet
startup behaviour, and validate usage errors before starting the Spark
session. IPython becomes a runtime dependency of the package.
Factor the shared --format/-o/--limit/--overwrite output options and the
output-format choice into a single decorator in the render module, so every
command declares its output surface once instead of repeating it. Extract the
Pathling and Delta Spark builder configuration into a single helper reused by
both the context factory and the CLI, removing the duplicated package wiring
that could otherwise drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CLI now checks the current working directory for pathling.toml
before falling back to the user-level config.toml. Exactly one file
is loaded (no merging): explicit --config > project-local pathling.toml
> user-level config.toml > built-in defaults.

When a project-local file is discovered, a one-line dim notice is
printed naming the file (and the user-level file it overrides, if
present), so the active configuration is never a silent surprise.
A config file that is a directory or otherwise unreadable now raises a
CliError naming the file, rather than surfacing a raw OSError or silently
falling back to another config source. This brings the read-failure
behaviour into line with the project-local discovery edge case, and
applies uniformly to explicit, project-local, and user-level files.
Users frequently need to tune Spark settings (executor memory,
shuffle partitions, cloud-storage connectors) without touching
Python code. This change introduces two complementary entry points:

- A `[spark]` table in pathling.toml for persistent per-project
  settings.
- A repeatable `--spark-conf KEY=VALUE` CLI flag for one-off
  overrides on a single invocation; the flag wins over the file.

All keys must begin with `spark.`; values may be strings, integers,
floats, or booleans (coerced to string). Secret references (`@file`)
are supported. The resolved settings are merged with Pathling's
managed defaults (jars.packages, sql.extensions, catalog) rather
than replacing them, so the library continues to function while
user tuning takes effect.

The managed Spark defaults are extracted into a new
`pathling._spark_defaults` module so that `context.py` and the
CLI merge logic share a single source of truth instead of
duplicating the coordinates.
Writing Parquet via pandas round-trips data through Python objects,
which silently degrades nested struct/array columns to anonymous lists
and coerces nullable integers to floating point. Collecting to an Arrow
table with toArrow() and writing with pyarrow.parquet preserves column
types faithfully.

Arrow-based columnar transfer is also enabled on the Spark session
(spark.sql.execution.arrow.pyspark.enabled=true) so that toPandas()
calls for CSV/JSON/NDJSON output and interactive console use also
benefit. An explicit --spark-conf value still overrides this default.
Both data source mode and single-resource mode now expose the same
column name ("result") for the evaluated expression output.  This
makes the two modes consistent so callers can process the output
uniformly without having to branch on which mode was used.

The companion schema comment and the CLI documentation are updated to
explain the symmetry explicitly.
Replace the driver-side collect-and-write approach (pandas/PyArrow) with
Spark's distributed writers for CSV, NDJSON, and Parquet. By default the
single-partition output is departitioned to the requested path via a
Hadoop FileSystem rename (same-filesystem, no cross-device copy); pass
--no-departition to keep Spark's native directory of part files.

Removes the JSON-array output format (--format json / .json extension),
which was a driver-side-only artefact; a helpful error points users at
NDJSON instead. The row-count confirmation is replaced by a format-and-
path message because Spark's write path returns no row count, and
obtaining one would re-execute terminology UDFs.

Departitioning logic lives in a new departition module and operates
uniformly over local, S3, HDFS, and other Hadoop-compatible destinations.
The constant was already unreferenced and served no purpose. Removing it
in favour of the format sets that are actually used.
Replace the terse multi-command examples with a single, concrete
end-to-end script that shows how to read NDJSON, project a tabular
view with view(), and summarise the result via Spark SQL. The \b
marker preserves the indented block formatting in the Click help
output. The same example is added to the CLI reference page so
users can see it in the documentation site as well.
Previously the target code had to come from a dataset column
(--other-code-column), requiring users to add a constant column when
testing against a single known concept.

This change mirrors the existing fixed-vs-column pattern used for the
system URI: a new --other-code flag accepts a literal code applied to
every row, while --other-code-column remains available for per-row
comparisons. Exactly one of the two must be supplied; the same mutual-
exclusion rule is applied to --other-system / --other-system-column.
Validation runs before the Spark session is created so usage errors
fail fast.
Address a cluster of user-story fixes in the Python CLI:

- Auth vs. non-auth failure distinction (FR-001): export only claims
  authentication failed when the exception text looks like an auth
  failure; connection/timeout/5xx errors are surfaced as their true
  root cause.
- Explicit --config must exist (FR-002): a missing explicit config
  path is now a usage error rather than a silent fallback to another
  file's credentials.
- Avoid re-reading config in export (FR-003): CliConfig now carries the
  parsed [bulk-auth] table so export resolves credentials from the
  already-loaded config.
- Auth input without client ID is an error (FR-004): partial auth input
  (token endpoint or secret but no client ID) raises a usage error
  instead of silently falling through to an unauthenticated run.
- Partial terminology auth warns the user (FR-005).
- Connection errors name the configured server URL (FR-011).
- Bare "parse error" no longer misclassified as FHIRPath (FR-012).
- Summary tables no longer trigger per-type row-count Spark jobs
  (FR-013); resource types and output path are reported instead.
- Bundle detection reads only a leading prefix of each file, not the
  full JSON (FR-014).
- convert, view, and fhirpath pass the known resource type to the
  Bundles reader to skip driver-side discovery (FR-015).
- Bundle discovery runs under the progress spinner in convert (FR-016).
- Quiet log4j2 config ships as package data instead of a per-run
  NamedTemporaryFile (FR-017).
- Coding column struct derives field names from the library's own
  schema so the CLI cannot drift from it (FR-018).
- Package __getattr__ now also resolves lazy submodule access.
dbplyr 2.6.0 is incompatible with sparklyr 1.9.4. Its query-fields probe
emits standard SQL with double-quoted identifiers (SELECT 0L AS "path"),
which the Spark parser rejects, breaking every tbl_spark operation in the
R tests. sparklyr declares no upper bound on dbplyr, so CI installed 2.6.0
within hours of its release and the R module started failing.

Pin dbplyr to the last working release after installing the dev
dependencies, so the version holds regardless of what the package cache
restored. Remove once a sparklyr release supports dbplyr 2.6.0.
Bump the org.hl7.fhir.* dependency overrides to 6.9.10 to pick up the fix
for CVE-2026-55471 (XXE in XsltUtilities.saxonTransform via an unhardened
Saxon TransformerFactory).

Suppress newly disclosed Netty CVEs under the existing rationale that Netty
is a provided dependency and is not bundled into the distribution.
The mycila license-maven-plugin scanned the entire project base directory,
so local agent worktrees under .claude (and its .opencode symlink) caused the
license header check to fail. Exclude these tooling directories, consistent
with the existing .github and .mvn exclusions.
Brings the fix for safe FHIRPath evaluation under Spark ANSI mode
(issue #2629) into the 9.8.0 release branch, along with the related
review refactor and the dbplyr 2.5.2 pin.
@johngrimes johngrimes self-assigned this Jun 19, 2026
@johngrimes johngrimes added the release Pull request that represents a new release label Jun 19, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in Pathling Jun 19, 2026
@johngrimes johngrimes moved this from Backlog to In progress in Pathling Jun 19, 2026
The dbplyr pin that keeps the R API build working was removed when the
accidental merge of #2630 was reverted. Without it, dependency
installation pulls dbplyr 2.6.0, whose query-fields probe emits SQL with
double-quoted identifiers that the Spark parser rejects, failing every
tbl_spark operation in the R tests. Re-add the pin so the R module builds
again.
CVE-2026-42578 affects netty-handler-proxy, which Pathling receives as a
provided transitive dependency via Spark and does not bundle into its
distribution. It belongs to the same batch of Netty CVEs already
suppressed for the 9.8.0 release but was not included at the time.
Suppressing it clears the only new security finding that was failing the
SonarCloud quality gate.
The issue #2629 ANSI-mode safety changes were correctly re-merged into the
release branch but were then silently removed when origin/main was merged
in, because main still carried a revert of the original #2630 merge. The
3-way merge resolved the reverted side as a deletion, dropping the fix and
its tests with no conflict and no test failure to signal the loss.

Restore the reviewed fix files to their post-review state so that 9.8.0
ships safe under Spark 4's default ANSI mode: try_* arithmetic operators,
elementTryCast for non-conforming casts, guarded size(null) in
count/isEmpty/toBoolean, safe-cast Quantity encoding and decimal
normalisation, and the dual-mode ANSI test harness.
@johngrimes johngrimes deployed to maven-central June 19, 2026 06:31 — with GitHub Actions Active
@johngrimes johngrimes merged commit 7da930d into main Jun 20, 2026
10 checks passed
@johngrimes johngrimes deleted the release/9.8.0 branch June 20, 2026 01:28
@github-project-automation github-project-automation Bot moved this from In progress to Done in Pathling Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release Pull request that represents a new release

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants