v9.8.0#2632
Merged
Merged
Conversation
Add a `pathling` console script to the Python package that surfaces the library's functionality through a flat, verb-based command tree: data conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and terminology operations over code datasets. The CLI is built with Click and Rich, resolves configuration from flags then an optional TOML file, and is installable and runnable via `uv tool` / `uvx`. Package public names are now exposed lazily so that importing the CLI does not pull in PySpark, keeping `--help` and `--version` fast. Spark and JVM log output is suppressed by default and re-enabled with `--verbose`, errors are rendered as concise messages without Java stack traces, and worker Python is pinned to the driver interpreter to avoid version mismatches.
The run command executes user-supplied Python code from a script file, stdin, or an inline -c option with spark and pathling variables already bound, reproducing Python interpreter semantics for sys.argv, __main__, __file__, sys.path, tracebacks, and SystemExit propagation. The console command opens an interactive IPython session over the same namespace, preceded by a banner naming the version and in-scope variables. Both commands reuse the existing configuration resolution and quiet startup behaviour, and validate usage errors before starting the Spark session. IPython becomes a runtime dependency of the package.
Factor the shared --format/-o/--limit/--overwrite output options and the output-format choice into a single decorator in the render module, so every command declares its output surface once instead of repeating it. Extract the Pathling and Delta Spark builder configuration into a single helper reused by both the context factory and the CLI, removing the duplicated package wiring that could otherwise drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CLI now checks the current working directory for pathling.toml before falling back to the user-level config.toml. Exactly one file is loaded (no merging): explicit --config > project-local pathling.toml > user-level config.toml > built-in defaults. When a project-local file is discovered, a one-line dim notice is printed naming the file (and the user-level file it overrides, if present), so the active configuration is never a silent surprise.
A config file that is a directory or otherwise unreadable now raises a CliError naming the file, rather than surfacing a raw OSError or silently falling back to another config source. This brings the read-failure behaviour into line with the project-local discovery edge case, and applies uniformly to explicit, project-local, and user-level files.
Users frequently need to tune Spark settings (executor memory, shuffle partitions, cloud-storage connectors) without touching Python code. This change introduces two complementary entry points: - A `[spark]` table in pathling.toml for persistent per-project settings. - A repeatable `--spark-conf KEY=VALUE` CLI flag for one-off overrides on a single invocation; the flag wins over the file. All keys must begin with `spark.`; values may be strings, integers, floats, or booleans (coerced to string). Secret references (`@file`) are supported. The resolved settings are merged with Pathling's managed defaults (jars.packages, sql.extensions, catalog) rather than replacing them, so the library continues to function while user tuning takes effect. The managed Spark defaults are extracted into a new `pathling._spark_defaults` module so that `context.py` and the CLI merge logic share a single source of truth instead of duplicating the coordinates.
Writing Parquet via pandas round-trips data through Python objects, which silently degrades nested struct/array columns to anonymous lists and coerces nullable integers to floating point. Collecting to an Arrow table with toArrow() and writing with pyarrow.parquet preserves column types faithfully. Arrow-based columnar transfer is also enabled on the Spark session (spark.sql.execution.arrow.pyspark.enabled=true) so that toPandas() calls for CSV/JSON/NDJSON output and interactive console use also benefit. An explicit --spark-conf value still overrides this default.
Both data source mode and single-resource mode now expose the same
column name ("result") for the evaluated expression output. This
makes the two modes consistent so callers can process the output
uniformly without having to branch on which mode was used.
The companion schema comment and the CLI documentation are updated to
explain the symmetry explicitly.
Replace the driver-side collect-and-write approach (pandas/PyArrow) with Spark's distributed writers for CSV, NDJSON, and Parquet. By default the single-partition output is departitioned to the requested path via a Hadoop FileSystem rename (same-filesystem, no cross-device copy); pass --no-departition to keep Spark's native directory of part files. Removes the JSON-array output format (--format json / .json extension), which was a driver-side-only artefact; a helpful error points users at NDJSON instead. The row-count confirmation is replaced by a format-and- path message because Spark's write path returns no row count, and obtaining one would re-execute terminology UDFs. Departitioning logic lives in a new departition module and operates uniformly over local, S3, HDFS, and other Hadoop-compatible destinations.
The constant was already unreferenced and served no purpose. Removing it in favour of the format sets that are actually used.
Replace the terse multi-command examples with a single, concrete end-to-end script that shows how to read NDJSON, project a tabular view with view(), and summarise the result via Spark SQL. The \b marker preserves the indented block formatting in the Click help output. The same example is added to the CLI reference page so users can see it in the documentation site as well.
Previously the target code had to come from a dataset column (--other-code-column), requiring users to add a constant column when testing against a single known concept. This change mirrors the existing fixed-vs-column pattern used for the system URI: a new --other-code flag accepts a literal code applied to every row, while --other-code-column remains available for per-row comparisons. Exactly one of the two must be supplied; the same mutual- exclusion rule is applied to --other-system / --other-system-column. Validation runs before the Spark session is created so usage errors fail fast.
Address a cluster of user-story fixes in the Python CLI: - Auth vs. non-auth failure distinction (FR-001): export only claims authentication failed when the exception text looks like an auth failure; connection/timeout/5xx errors are surfaced as their true root cause. - Explicit --config must exist (FR-002): a missing explicit config path is now a usage error rather than a silent fallback to another file's credentials. - Avoid re-reading config in export (FR-003): CliConfig now carries the parsed [bulk-auth] table so export resolves credentials from the already-loaded config. - Auth input without client ID is an error (FR-004): partial auth input (token endpoint or secret but no client ID) raises a usage error instead of silently falling through to an unauthenticated run. - Partial terminology auth warns the user (FR-005). - Connection errors name the configured server URL (FR-011). - Bare "parse error" no longer misclassified as FHIRPath (FR-012). - Summary tables no longer trigger per-type row-count Spark jobs (FR-013); resource types and output path are reported instead. - Bundle detection reads only a leading prefix of each file, not the full JSON (FR-014). - convert, view, and fhirpath pass the known resource type to the Bundles reader to skip driver-side discovery (FR-015). - Bundle discovery runs under the progress spinner in convert (FR-016). - Quiet log4j2 config ships as package data instead of a per-run NamedTemporaryFile (FR-017). - Coding column struct derives field names from the library's own schema so the CLI cannot drift from it (FR-018). - Package __getattr__ now also resolves lazy submodule access.
dbplyr 2.6.0 is incompatible with sparklyr 1.9.4. Its query-fields probe emits standard SQL with double-quoted identifiers (SELECT 0L AS "path"), which the Spark parser rejects, breaking every tbl_spark operation in the R tests. sparklyr declares no upper bound on dbplyr, so CI installed 2.6.0 within hours of its release and the R module started failing. Pin dbplyr to the last working release after installing the dev dependencies, so the version holds regardless of what the package cache restored. Remove once a sparklyr release supports dbplyr 2.6.0.
Bump the org.hl7.fhir.* dependency overrides to 6.9.10 to pick up the fix for CVE-2026-55471 (XXE in XsltUtilities.saxonTransform via an unhardened Saxon TransformerFactory). Suppress newly disclosed Netty CVEs under the existing rationale that Netty is a provided dependency and is not bundled into the distribution.
The mycila license-maven-plugin scanned the entire project base directory, so local agent worktrees under .claude (and its .opencode symlink) caused the license header check to fail. Exclude these tooling directories, consistent with the existing .github and .mvn exclusions.
Brings the fix for safe FHIRPath evaluation under Spark ANSI mode (issue #2629) into the 9.8.0 release branch, along with the related review refactor and the dbplyr 2.5.2 pin.
The dbplyr pin that keeps the R API build working was removed when the accidental merge of #2630 was reverted. Without it, dependency installation pulls dbplyr 2.6.0, whose query-fields probe emits SQL with double-quoted identifiers that the Spark parser rejects, failing every tbl_spark operation in the R tests. Re-add the pin so the R module builds again.
CVE-2026-42578 affects netty-handler-proxy, which Pathling receives as a provided transitive dependency via Spark and does not bundle into its distribution. It belongs to the same batch of Netty CVEs already suppressed for the 9.8.0 release but was not included at the time. Suppressing it clears the only new security finding that was failing the SonarCloud quality gate.
The issue #2629 ANSI-mode safety changes were correctly re-merged into the release branch but were then silently removed when origin/main was merged in, because main still carried a revert of the original #2630 merge. The 3-way merge resolved the reverted side as a deletion, dropping the fix and its tests with no conflict and no test failure to signal the loss. Restore the reviewed fix files to their post-review state so that 9.8.0 ships safe under Spark 4's default ANSI mode: try_* arithmetic operators, elementTryCast for non-conforming casts, guarded size(null) in count/isEmpty/toBoolean, safe-cast Quantity encoding and decimal normalisation, and the dual-mode ANSI test harness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.