Add command line interface by johngrimes · Pull Request #2628 · aehrc/pathling

johngrimes · 2026-06-12T04:32:20Z

This adds a pathling console script to the Python package, surfacing the library's functionality through a flat, verb-based command tree. It covers data conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and terminology operations over code datasets. The CLI is built with Click and Rich, resolves configuration from flags then an optional TOML file, and is installable and runnable via uv tool / uvx so its version always matches the library.

The package's public names are now exposed lazily, so importing the CLI doesn't pull in PySpark and --help/--version stay fast. Spark and JVM logging is suppressed by default (and re-enabled with --verbose), errors are rendered as concise messages without Java stack traces, data goes to stdout while progress and errors go to stderr, and the Spark worker interpreter is pinned to the driver to avoid version mismatches.

Commands: convert, view, fhirpath, export, and the terminology operations member-of, translate, subsumes, subsumed-by, display, property-of, and designation.

Tested with unit tests for the parsing, config, rendering, and error-mapping logic, per-command integration tests through Click's runner against the existing Spark and mock-server fixtures, and a subprocess smoke test of the installed entry point. Documentation is added under site/docs/libraries and the Python README.

Add a `pathling` console script to the Python package that surfaces the library's functionality through a flat, verb-based command tree: data conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and terminology operations over code datasets. The CLI is built with Click and Rich, resolves configuration from flags then an optional TOML file, and is installable and runnable via `uv tool` / `uvx`. Package public names are now exposed lazily so that importing the CLI does not pull in PySpark, keeping `--help` and `--version` fast. Spark and JVM log output is suppressed by default and re-enabled with `--verbose`, errors are rendered as concise messages without Java stack traces, and worker Python is pinned to the driver interpreter to avoid version mismatches.

The run command executes user-supplied Python code from a script file, stdin, or an inline -c option with spark and pathling variables already bound, reproducing Python interpreter semantics for sys.argv, __main__, __file__, sys.path, tracebacks, and SystemExit propagation. The console command opens an interactive IPython session over the same namespace, preceded by a banner naming the version and in-scope variables. Both commands reuse the existing configuration resolution and quiet startup behaviour, and validate usage errors before starting the Spark session. IPython becomes a runtime dependency of the package.

Factor the shared --format/-o/--limit/--overwrite output options and the output-format choice into a single decorator in the render module, so every command declares its output surface once instead of repeating it. Extract the Pathling and Delta Spark builder configuration into a single helper reused by both the context factory and the CLI, removing the duplicated package wiring that could otherwise drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul

Reviewed at 7a105921fd. Consolidates three independent passes (two manual + one fresh reviewer with no shared context) plus empirical verification where noted (✅). The CLI is well-structured — clean module separation, fail-fast validation before Spark cold-start, careful stdout/stderr discipline, and genuinely fast --help/--version via the lazy import shim. The items below are prioritized; most are pre-merge fixes, with two enhancement requests at the end.

🔴 Should fix before merge

1. Broad except mislabels every export failure as an auth failure — cli/export.py (try/except around pc.read.bulk(...))
When auth is configured, the catch-all re-raises any failure (server down, timeout, HTTP 500 during kickoff/polling/download) as "Authentication failed … check the client ID, credential, and token endpoint." Misdiagnoses network/server errors as bad credentials. → Narrow the catch to auth/401, or include the unwrapped root cause without asserting it's an auth problem.

2. Mistyped --config <missing> silently falls back to the default config's [bulk-auth] — cli/export.py + cli/config.py
resolve_config nulls config_path when the path doesn't exist; export then does config.config_path or default_config_path(), so --config /typo.toml makes export authenticate with ~/.config/pathling/config.toml's credentials while the rest of the CLI read nothing from the explicit file. Silent, security-relevant. → Error if an explicit --config path doesn't exist; reuse the already-resolved config.

3. Incomplete bulk-auth runs the export unauthenticated ✅ — cli/config.py resolve_bulk_auth
Supplying --client-secret/--private-key-jwk + --token-endpoint but omitting --client-id returns None, so export runs with no auth and fails later with an opaque 401 instead of "you forgot --client-id." → If any bulk-auth flag/secret is given but client-id is missing, raise a usage error.

4. Lazy exports break pathling.<submodule>.X attribute access ✅ — pathling/__init__.py
The PEP 562 shim dropped the eager from .functions import … / from .udfs import …, so import pathling; pathling.udfs.member_of now raises AttributeError (it "heals" only after a lazy name from that submodule, or import pathling.udfs, runs). The documented top-level API (from pathling import member_of) is unaffected, so blast radius depends on downstream submodule-attribute usage — but it's a silent, order-dependent regression of a previously-valid pattern. → In __getattr__, fall back to importing a known submodule and cache it; add those names to __dir__.

🟠 Worth fixing

5. Test suite isn't portable to Python ≥ 3.10 ✅ — tests/cli/conftest.py
CliRunner(mix_stderr=False): the kwarg was removed in Click 8.2.0 (which also dropped Python 3.9). The lockfile resolves Click 8.1.8 for <3.10 and 8.3.1 for ≥3.10. CI stays green because the build venv is pinned to 3.9 (pom.xml -p 3.9, .python-version), but the suite errors at fixture setup on any modern interpreter. Runtime is unaffected (test-only). → Feature-detect: pass mix_stderr=False only if the parameter exists in CliRunner.__init__.

6. friendly_message's connection branch is dead code — cli/main.py / cli/errors.py
The central handler never passes server_url, so "Could not reach the server at " never fires for convert/view/fhirpath; they fall through to the generic "--verbose" message. → Thread the configured tx_server in, or remove the branch.

7. Post-write summaries re-scan output with per-type count() — cli/export.py, cli/convert.py _print_summary
After the read+write pass, the summary calls .read(rt).count() per resource type — N extra full Spark jobs just to print row counts. → Derive counts from the write, or make the summary opt-in.

8. Bundles are parsed twice — cli/io.py
discover_bundle_resource_types json.loads every bundle on the driver (single-threaded) to enumerate types, then pc.read.bundles reparses them in Spark; _looks_like_bundle also full-parses the first JSON file (the XML branch reads only a prefix), and discovery runs outside the progress spinner. → Read a prefix in _looks_like_bundle; stream discovery / run it under the spinner / add --type to skip it; longer-term let the library reader discover types.

9. IPython/rich are now hard runtime deps of the whole library — pyproject.toml
ipython>=8.18.1 (CLI-only, used by console) is pulled into every pip install pathling, even for API-only users. → Move CLI deps to an optional extra (pathling[cli]) and import-guard console.

10. Quiet log4j2 temp file is never removed — cli/session.py _write_quiet_log4j2
NamedTemporaryFile(delete=False) with no cleanup; every non-verbose run leaks one pathling-cli-log4j2-*.properties into the temp dir. → atexit-unlink, or (since the content is constant) write once to a cache path / ship as package data.

11. File output fully materialises on the driver; "stream" wording is inaccurate — cli/render.py
stdout csv/json/ndjson use df.collect() and -o uses df.toPandas() — both buffer the whole result in driver memory, but the docstring and docs say csv/json/ndjson "stream the full result." (The --limit-applies-to-table-only behaviour is by design and correctly documented — not a bug.) → Fix the "stream" wording; optionally use toLocalIterator() for incremental stdout. (See enhancement E-2 below for the scalable-output side of this.)

🔵 Minor / nits

Terminology _validate_columns runs after Spark cold-start (a typo'd --code-column only errors post-startup) — cli/terminology.py.
_categorise keys "fhirpath" off the broad substring "parse error" — cli/errors.py.
Partial terminology auth is silently disabled (no warning) unless both client_id + token_endpoint are set — cli/config.py.
Hand-built Coding struct duplicates the library's Coding schema (drift risk) — cli/terminology.py.
Unrelated README/Databricks-doc markdown reformatting inflates the diff.

🟢 Enhancement requests (not blockers)

E-1. Make the Spark session configurable via the config file (esp. driver memory). There's currently no way to set spark.driver.memory or other Spark options. Suggest a [spark] table in the existing TOML config, threaded into _build_spark_session alongside tx-server/[*-auth], e.g.:

[spark]
"spark.driver.memory" = "8g"

with optional --conf key=value overrides. Note: spark.driver.memory must be applied before the driver JVM launches, so verify the mechanism takes effect in local mode.

E-2. view: support both single-file and partitioned Spark output. view -o is currently always single-file (toPandas()); offer a --single-file/--partitioned (or --coalesce N) choice that routes partitioned writes through the Spark writer for the formats that support it (Parquet/Delta/CSV/JSON), as convert already does. This is the scalable counterpart to #11.

Assessment

Merge with fixes. Strongest items are the error/auth-correctness bugs (#1–#3) and the API regression (#4); #5 unblocks contributors on modern Python. The rest are quality/perf improvements and two enhancement requests.

_{🤖 Review assembled with Claude Code assistance, with findings verified against the code where marked ✅.}

The CLI now checks the current working directory for pathling.toml before falling back to the user-level config.toml. Exactly one file is loaded (no merging): explicit --config > project-local pathling.toml > user-level config.toml > built-in defaults. When a project-local file is discovered, a one-line dim notice is printed naming the file (and the user-level file it overrides, if present), so the active configuration is never a silent surprise.

A config file that is a directory or otherwise unreadable now raises a CliError naming the file, rather than surfacing a raw OSError or silently falling back to another config source. This brings the read-failure behaviour into line with the project-local discovery edge case, and applies uniformly to explicit, project-local, and user-level files.

Users frequently need to tune Spark settings (executor memory, shuffle partitions, cloud-storage connectors) without touching Python code. This change introduces two complementary entry points: - A `[spark]` table in pathling.toml for persistent per-project settings. - A repeatable `--spark-conf KEY=VALUE` CLI flag for one-off overrides on a single invocation; the flag wins over the file. All keys must begin with `spark.`; values may be strings, integers, floats, or booleans (coerced to string). Secret references (`@file`) are supported. The resolved settings are merged with Pathling's managed defaults (jars.packages, sql.extensions, catalog) rather than replacing them, so the library continues to function while user tuning takes effect. The managed Spark defaults are extracted into a new `pathling._spark_defaults` module so that `context.py` and the CLI merge logic share a single source of truth instead of duplicating the coordinates.

Writing Parquet via pandas round-trips data through Python objects, which silently degrades nested struct/array columns to anonymous lists and coerces nullable integers to floating point. Collecting to an Arrow table with toArrow() and writing with pyarrow.parquet preserves column types faithfully. Arrow-based columnar transfer is also enabled on the Spark session (spark.sql.execution.arrow.pyspark.enabled=true) so that toPandas() calls for CSV/JSON/NDJSON output and interactive console use also benefit. An explicit --spark-conf value still overrides this default.

Both data source mode and single-resource mode now expose the same column name ("result") for the evaluated expression output. This makes the two modes consistent so callers can process the output uniformly without having to branch on which mode was used. The companion schema comment and the CLI documentation are updated to explain the symmetry explicitly.

Replace the driver-side collect-and-write approach (pandas/PyArrow) with Spark's distributed writers for CSV, NDJSON, and Parquet. By default the single-partition output is departitioned to the requested path via a Hadoop FileSystem rename (same-filesystem, no cross-device copy); pass --no-departition to keep Spark's native directory of part files. Removes the JSON-array output format (--format json / .json extension), which was a driver-side-only artefact; a helpful error points users at NDJSON instead. The row-count confirmation is replaced by a format-and- path message because Spark's write path returns no row count, and obtaining one would re-execute terminology UDFs. Departitioning logic lives in a new departition module and operates uniformly over local, S3, HDFS, and other Hadoop-compatible destinations.

The constant was already unreferenced and served no purpose. Removing it in favour of the format sets that are actually used.

Replace the terse multi-command examples with a single, concrete end-to-end script that shows how to read NDJSON, project a tabular view with view(), and summarise the result via Spark SQL. The \b marker preserves the indented block formatting in the Click help output. The same example is added to the CLI reference page so users can see it in the documentation site as well.

Previously the target code had to come from a dataset column (--other-code-column), requiring users to add a constant column when testing against a single known concept. This change mirrors the existing fixed-vs-column pattern used for the system URI: a new --other-code flag accepts a literal code applied to every row, while --other-code-column remains available for per-row comparisons. Exactly one of the two must be supplied; the same mutual- exclusion rule is applied to --other-system / --other-system-column. Validation runs before the Spark session is created so usage errors fail fast.

Address a cluster of user-story fixes in the Python CLI: - Auth vs. non-auth failure distinction (FR-001): export only claims authentication failed when the exception text looks like an auth failure; connection/timeout/5xx errors are surfaced as their true root cause. - Explicit --config must exist (FR-002): a missing explicit config path is now a usage error rather than a silent fallback to another file's credentials. - Avoid re-reading config in export (FR-003): CliConfig now carries the parsed [bulk-auth] table so export resolves credentials from the already-loaded config. - Auth input without client ID is an error (FR-004): partial auth input (token endpoint or secret but no client ID) raises a usage error instead of silently falling through to an unauthenticated run. - Partial terminology auth warns the user (FR-005). - Connection errors name the configured server URL (FR-011). - Bare "parse error" no longer misclassified as FHIRPath (FR-012). - Summary tables no longer trigger per-type row-count Spark jobs (FR-013); resource types and output path are reported instead. - Bundle detection reads only a leading prefix of each file, not the full JSON (FR-014). - convert, view, and fhirpath pass the known resource type to the Bundles reader to skip driver-side discovery (FR-015). - Bundle discovery runs under the progress spinner in convert (FR-016). - Quiet log4j2 config ships as package data instead of a per-run NamedTemporaryFile (FR-017). - Coding column struct derives field names from the library's own schema so the CLI cannot drift from it (FR-018). - Package __getattr__ now also resolves lazy submodule access.

dbplyr 2.6.0 is incompatible with sparklyr 1.9.4. Its query-fields probe emits standard SQL with double-quoted identifiers (SELECT 0L AS "path"), which the Spark parser rejects, breaking every tbl_spark operation in the R tests. sparklyr declares no upper bound on dbplyr, so CI installed 2.6.0 within hours of its release and the R module started failing. Pin dbplyr to the last working release after installing the dev dependencies, so the version holds regardless of what the package cache restored. Remove once a sparklyr release supports dbplyr 2.6.0.

sonarqubecloud · 2026-06-18T03:48:56Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

piotrszul

Looks good to me.

johngrimes added 2 commits June 12, 2026 13:00

johngrimes marked this pull request as ready for review June 12, 2026 08:37

johngrimes assigned piotrszul Jun 12, 2026

johngrimes requested a review from piotrszul June 12, 2026 08:37

johngrimes added new feature New feature or request python Pull requests that update Python code labels Jun 12, 2026

johngrimes added this to Pathling Jun 12, 2026

github-project-automation Bot moved this to Backlog in Pathling Jun 12, 2026

johngrimes moved this from Backlog to In progress in Pathling Jun 12, 2026

piotrszul reviewed Jun 15, 2026

View reviewed changes

johngrimes added 15 commits June 17, 2026 11:57

Bump lodash and lodash-es to 4.18.1

d712f69

refactor: Remove unused FILE_FORMATS constant

3e983e7

The constant was already unreferenced and served no purpose. Removing it in favour of the format sets that are actually used.

Add reference doc links to CLI command help

e40e438

Rename --coding-version option to --system-version

c9d7ec7

Reduce default stdout table row limit to 50

2bb72d1

johngrimes requested a review from piotrszul June 18, 2026 07:03

piotrszul approved these changes Jun 18, 2026

View reviewed changes

johngrimes moved this from In progress to Done in Pathling Jun 19, 2026

johngrimes merged commit 220d94f into main Jun 20, 2026
5 checks passed

johngrimes deleted the cli branch June 20, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add command line interface#2628

Add command line interface#2628
johngrimes merged 18 commits into
mainfrom
cli

johngrimes commented Jun 12, 2026

Uh oh!

piotrszul left a comment

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Uh oh!

piotrszul left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johngrimes commented Jun 12, 2026

Uh oh!

piotrszul left a comment

Choose a reason for hiding this comment

🔴 Should fix before merge

🟠 Worth fixing

🔵 Minor / nits

🟢 Enhancement requests (not blockers)

Assessment

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Quality Gate passed

Uh oh!

piotrszul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants