Skip to content

Add command line interface#2628

Merged
johngrimes merged 18 commits into
mainfrom
cli
Jun 20, 2026
Merged

Add command line interface#2628
johngrimes merged 18 commits into
mainfrom
cli

Conversation

@johngrimes

Copy link
Copy Markdown
Member

This adds a pathling console script to the Python package, surfacing the library's functionality through a flat, verb-based command tree. It covers data conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and terminology operations over code datasets. The CLI is built with Click and Rich, resolves configuration from flags then an optional TOML file, and is installable and runnable via uv tool / uvx so its version always matches the library.

The package's public names are now exposed lazily, so importing the CLI doesn't pull in PySpark and --help/--version stay fast. Spark and JVM logging is suppressed by default (and re-enabled with --verbose), errors are rendered as concise messages without Java stack traces, data goes to stdout while progress and errors go to stderr, and the Spark worker interpreter is pinned to the driver to avoid version mismatches.

Commands: convert, view, fhirpath, export, and the terminology operations member-of, translate, subsumes, subsumed-by, display, property-of, and designation.

Tested with unit tests for the parsing, config, rendering, and error-mapping logic, per-command integration tests through Click's runner against the existing Spark and mock-server fixtures, and a subprocess smoke test of the installed entry point. Documentation is added under site/docs/libraries and the Python README.

Add a `pathling` console script to the Python package that surfaces the
library's functionality through a flat, verb-based command tree: data
conversion, SQL on FHIR views, FHIRPath evaluation, bulk export, and
terminology operations over code datasets. The CLI is built with Click and
Rich, resolves configuration from flags then an optional TOML file, and is
installable and runnable via `uv tool` / `uvx`.

Package public names are now exposed lazily so that importing the CLI does
not pull in PySpark, keeping `--help` and `--version` fast. Spark and JVM log
output is suppressed by default and re-enabled with `--verbose`, errors are
rendered as concise messages without Java stack traces, and worker Python is
pinned to the driver interpreter to avoid version mismatches.
The run command executes user-supplied Python code from a script file,
stdin, or an inline -c option with spark and pathling variables already
bound, reproducing Python interpreter semantics for sys.argv, __main__,
__file__, sys.path, tracebacks, and SystemExit propagation. The console
command opens an interactive IPython session over the same namespace,
preceded by a banner naming the version and in-scope variables.

Both commands reuse the existing configuration resolution and quiet
startup behaviour, and validate usage errors before starting the Spark
session. IPython becomes a runtime dependency of the package.
@johngrimes johngrimes marked this pull request as ready for review June 12, 2026 08:37
@johngrimes johngrimes requested a review from piotrszul June 12, 2026 08:37
@johngrimes johngrimes added new feature New feature or request python Pull requests that update Python code labels Jun 12, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in Pathling Jun 12, 2026
@johngrimes johngrimes moved this from Backlog to In progress in Pathling Jun 12, 2026
Factor the shared --format/-o/--limit/--overwrite output options and the
output-format choice into a single decorator in the render module, so every
command declares its output surface once instead of repeating it. Extract the
Pathling and Delta Spark builder configuration into a single helper reused by
both the context factory and the CLI, removing the duplicated package wiring
that could otherwise drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@piotrszul piotrszul left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed at 7a105921fd. Consolidates three independent passes (two manual + one fresh reviewer with no shared context) plus empirical verification where noted (✅). The CLI is well-structured — clean module separation, fail-fast validation before Spark cold-start, careful stdout/stderr discipline, and genuinely fast --help/--version via the lazy import shim. The items below are prioritized; most are pre-merge fixes, with two enhancement requests at the end.

🔴 Should fix before merge

1. Broad except mislabels every export failure as an auth failurecli/export.py (try/except around pc.read.bulk(...))
When auth is configured, the catch-all re-raises any failure (server down, timeout, HTTP 500 during kickoff/polling/download) as "Authentication failed … check the client ID, credential, and token endpoint." Misdiagnoses network/server errors as bad credentials. → Narrow the catch to auth/401, or include the unwrapped root cause without asserting it's an auth problem.

2. Mistyped --config <missing> silently falls back to the default config's [bulk-auth]cli/export.py + cli/config.py
resolve_config nulls config_path when the path doesn't exist; export then does config.config_path or default_config_path(), so --config /typo.toml makes export authenticate with ~/.config/pathling/config.toml's credentials while the rest of the CLI read nothing from the explicit file. Silent, security-relevant. → Error if an explicit --config path doesn't exist; reuse the already-resolved config.

3. Incomplete bulk-auth runs the export unauthenticated ✅ — cli/config.py resolve_bulk_auth
Supplying --client-secret/--private-key-jwk + --token-endpoint but omitting --client-id returns None, so export runs with no auth and fails later with an opaque 401 instead of "you forgot --client-id." → If any bulk-auth flag/secret is given but client-id is missing, raise a usage error.

4. Lazy exports break pathling.<submodule>.X attribute access ✅ — pathling/__init__.py
The PEP 562 shim dropped the eager from .functions import … / from .udfs import …, so import pathling; pathling.udfs.member_of now raises AttributeError (it "heals" only after a lazy name from that submodule, or import pathling.udfs, runs). The documented top-level API (from pathling import member_of) is unaffected, so blast radius depends on downstream submodule-attribute usage — but it's a silent, order-dependent regression of a previously-valid pattern. → In __getattr__, fall back to importing a known submodule and cache it; add those names to __dir__.

🟠 Worth fixing

5. Test suite isn't portable to Python ≥ 3.10 ✅ — tests/cli/conftest.py
CliRunner(mix_stderr=False): the kwarg was removed in Click 8.2.0 (which also dropped Python 3.9). The lockfile resolves Click 8.1.8 for <3.10 and 8.3.1 for ≥3.10. CI stays green because the build venv is pinned to 3.9 (pom.xml -p 3.9, .python-version), but the suite errors at fixture setup on any modern interpreter. Runtime is unaffected (test-only). → Feature-detect: pass mix_stderr=False only if the parameter exists in CliRunner.__init__.

6. friendly_message's connection branch is dead codecli/main.py / cli/errors.py
The central handler never passes server_url, so "Could not reach the server at " never fires for convert/view/fhirpath; they fall through to the generic "--verbose" message. → Thread the configured tx_server in, or remove the branch.

7. Post-write summaries re-scan output with per-type count()cli/export.py, cli/convert.py _print_summary
After the read+write pass, the summary calls .read(rt).count() per resource type — N extra full Spark jobs just to print row counts. → Derive counts from the write, or make the summary opt-in.

8. Bundles are parsed twicecli/io.py
discover_bundle_resource_types json.loads every bundle on the driver (single-threaded) to enumerate types, then pc.read.bundles reparses them in Spark; _looks_like_bundle also full-parses the first JSON file (the XML branch reads only a prefix), and discovery runs outside the progress spinner. → Read a prefix in _looks_like_bundle; stream discovery / run it under the spinner / add --type to skip it; longer-term let the library reader discover types.

9. IPython/rich are now hard runtime deps of the whole librarypyproject.toml
ipython>=8.18.1 (CLI-only, used by console) is pulled into every pip install pathling, even for API-only users. → Move CLI deps to an optional extra (pathling[cli]) and import-guard console.

10. Quiet log4j2 temp file is never removedcli/session.py _write_quiet_log4j2
NamedTemporaryFile(delete=False) with no cleanup; every non-verbose run leaks one pathling-cli-log4j2-*.properties into the temp dir. → atexit-unlink, or (since the content is constant) write once to a cache path / ship as package data.

11. File output fully materialises on the driver; "stream" wording is inaccuratecli/render.py
stdout csv/json/ndjson use df.collect() and -o uses df.toPandas() — both buffer the whole result in driver memory, but the docstring and docs say csv/json/ndjson "stream the full result." (The --limit-applies-to-table-only behaviour is by design and correctly documented — not a bug.) → Fix the "stream" wording; optionally use toLocalIterator() for incremental stdout. (See enhancement E-2 below for the scalable-output side of this.)

🔵 Minor / nits

  • Terminology _validate_columns runs after Spark cold-start (a typo'd --code-column only errors post-startup) — cli/terminology.py.
  • _categorise keys "fhirpath" off the broad substring "parse error"cli/errors.py.
  • Partial terminology auth is silently disabled (no warning) unless both client_id + token_endpoint are set — cli/config.py.
  • Hand-built Coding struct duplicates the library's Coding schema (drift risk) — cli/terminology.py.
  • Unrelated README/Databricks-doc markdown reformatting inflates the diff.

🟢 Enhancement requests (not blockers)

E-1. Make the Spark session configurable via the config file (esp. driver memory). There's currently no way to set spark.driver.memory or other Spark options. Suggest a [spark] table in the existing TOML config, threaded into _build_spark_session alongside tx-server/[*-auth], e.g.:

[spark]
"spark.driver.memory" = "8g"

with optional --conf key=value overrides. Note: spark.driver.memory must be applied before the driver JVM launches, so verify the mechanism takes effect in local mode.

E-2. view: support both single-file and partitioned Spark output. view -o is currently always single-file (toPandas()); offer a --single-file/--partitioned (or --coalesce N) choice that routes partitioned writes through the Spark writer for the formats that support it (Parquet/Delta/CSV/JSON), as convert already does. This is the scalable counterpart to #11.

Assessment

Merge with fixes. Strongest items are the error/auth-correctness bugs (#1#3) and the API regression (#4); #5 unblocks contributors on modern Python. The rest are quality/perf improvements and two enhancement requests.


🤖 Review assembled with Claude Code assistance, with findings verified against the code where marked ✅.

The CLI now checks the current working directory for pathling.toml
before falling back to the user-level config.toml. Exactly one file
is loaded (no merging): explicit --config > project-local pathling.toml
> user-level config.toml > built-in defaults.

When a project-local file is discovered, a one-line dim notice is
printed naming the file (and the user-level file it overrides, if
present), so the active configuration is never a silent surprise.
A config file that is a directory or otherwise unreadable now raises a
CliError naming the file, rather than surfacing a raw OSError or silently
falling back to another config source. This brings the read-failure
behaviour into line with the project-local discovery edge case, and
applies uniformly to explicit, project-local, and user-level files.
Users frequently need to tune Spark settings (executor memory,
shuffle partitions, cloud-storage connectors) without touching
Python code. This change introduces two complementary entry points:

- A `[spark]` table in pathling.toml for persistent per-project
  settings.
- A repeatable `--spark-conf KEY=VALUE` CLI flag for one-off
  overrides on a single invocation; the flag wins over the file.

All keys must begin with `spark.`; values may be strings, integers,
floats, or booleans (coerced to string). Secret references (`@file`)
are supported. The resolved settings are merged with Pathling's
managed defaults (jars.packages, sql.extensions, catalog) rather
than replacing them, so the library continues to function while
user tuning takes effect.

The managed Spark defaults are extracted into a new
`pathling._spark_defaults` module so that `context.py` and the
CLI merge logic share a single source of truth instead of
duplicating the coordinates.
Writing Parquet via pandas round-trips data through Python objects,
which silently degrades nested struct/array columns to anonymous lists
and coerces nullable integers to floating point. Collecting to an Arrow
table with toArrow() and writing with pyarrow.parquet preserves column
types faithfully.

Arrow-based columnar transfer is also enabled on the Spark session
(spark.sql.execution.arrow.pyspark.enabled=true) so that toPandas()
calls for CSV/JSON/NDJSON output and interactive console use also
benefit. An explicit --spark-conf value still overrides this default.
Both data source mode and single-resource mode now expose the same
column name ("result") for the evaluated expression output.  This
makes the two modes consistent so callers can process the output
uniformly without having to branch on which mode was used.

The companion schema comment and the CLI documentation are updated to
explain the symmetry explicitly.
Replace the driver-side collect-and-write approach (pandas/PyArrow) with
Spark's distributed writers for CSV, NDJSON, and Parquet. By default the
single-partition output is departitioned to the requested path via a
Hadoop FileSystem rename (same-filesystem, no cross-device copy); pass
--no-departition to keep Spark's native directory of part files.

Removes the JSON-array output format (--format json / .json extension),
which was a driver-side-only artefact; a helpful error points users at
NDJSON instead. The row-count confirmation is replaced by a format-and-
path message because Spark's write path returns no row count, and
obtaining one would re-execute terminology UDFs.

Departitioning logic lives in a new departition module and operates
uniformly over local, S3, HDFS, and other Hadoop-compatible destinations.
The constant was already unreferenced and served no purpose. Removing it
in favour of the format sets that are actually used.
Replace the terse multi-command examples with a single, concrete
end-to-end script that shows how to read NDJSON, project a tabular
view with view(), and summarise the result via Spark SQL. The \b
marker preserves the indented block formatting in the Click help
output. The same example is added to the CLI reference page so
users can see it in the documentation site as well.
Previously the target code had to come from a dataset column
(--other-code-column), requiring users to add a constant column when
testing against a single known concept.

This change mirrors the existing fixed-vs-column pattern used for the
system URI: a new --other-code flag accepts a literal code applied to
every row, while --other-code-column remains available for per-row
comparisons. Exactly one of the two must be supplied; the same mutual-
exclusion rule is applied to --other-system / --other-system-column.
Validation runs before the Spark session is created so usage errors
fail fast.
Address a cluster of user-story fixes in the Python CLI:

- Auth vs. non-auth failure distinction (FR-001): export only claims
  authentication failed when the exception text looks like an auth
  failure; connection/timeout/5xx errors are surfaced as their true
  root cause.
- Explicit --config must exist (FR-002): a missing explicit config
  path is now a usage error rather than a silent fallback to another
  file's credentials.
- Avoid re-reading config in export (FR-003): CliConfig now carries the
  parsed [bulk-auth] table so export resolves credentials from the
  already-loaded config.
- Auth input without client ID is an error (FR-004): partial auth input
  (token endpoint or secret but no client ID) raises a usage error
  instead of silently falling through to an unauthenticated run.
- Partial terminology auth warns the user (FR-005).
- Connection errors name the configured server URL (FR-011).
- Bare "parse error" no longer misclassified as FHIRPath (FR-012).
- Summary tables no longer trigger per-type row-count Spark jobs
  (FR-013); resource types and output path are reported instead.
- Bundle detection reads only a leading prefix of each file, not the
  full JSON (FR-014).
- convert, view, and fhirpath pass the known resource type to the
  Bundles reader to skip driver-side discovery (FR-015).
- Bundle discovery runs under the progress spinner in convert (FR-016).
- Quiet log4j2 config ships as package data instead of a per-run
  NamedTemporaryFile (FR-017).
- Coding column struct derives field names from the library's own
  schema so the CLI cannot drift from it (FR-018).
- Package __getattr__ now also resolves lazy submodule access.
dbplyr 2.6.0 is incompatible with sparklyr 1.9.4. Its query-fields probe
emits standard SQL with double-quoted identifiers (SELECT 0L AS "path"),
which the Spark parser rejects, breaking every tbl_spark operation in the
R tests. sparklyr declares no upper bound on dbplyr, so CI installed 2.6.0
within hours of its release and the R module started failing.

Pin dbplyr to the last working release after installing the dev
dependencies, so the version holds regardless of what the package cache
restored. Remove once a sparklyr release supports dbplyr 2.6.0.
@sonarqubecloud

Copy link
Copy Markdown

@johngrimes johngrimes requested a review from piotrszul June 18, 2026 07:03

@piotrszul piotrszul left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@johngrimes johngrimes moved this from In progress to Done in Pathling Jun 19, 2026
@johngrimes johngrimes merged commit 220d94f into main Jun 20, 2026
5 checks passed
@johngrimes johngrimes deleted the cli branch June 20, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature New feature or request python Pull requests that update Python code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants