feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers by mauripunzueta · Pull Request #150 · HeliosSoftware/hfs

mauripunzueta · 2026-06-11T15:10:24Z

Summary

Adds a shared helios-observability crate and wires it into all four server
binaries (hfs, hts, sof-server, fhirpath-server):

GET /metrics Prometheus endpoint — http_requests_total,
http_request_duration_seconds, uptime_seconds, with a service global
label and a templated route label.
/health enriched with uptime_seconds + started_at.
Per-request metrics + a tracing span. Tenant is a span attribute only, never a
metric label (cardinality).
Feature-gated (otel) OTLP trace export via tracing-opentelemetry
(opentelemetry 0.32). OTLP metrics are produced out-of-process by a Collector
scraping /metrics — we avoid the unmaintained opentelemetry-prometheus
(protobuf RUSTSEC advisory).

Verification

Workspace cargo check (default) compiles; observability also compiles with
--features otel. Clippy clean (CI flags) for touched crates. Unit tests pass.
End-to-end smoke test against fhirpath-server: /health uptime + /metrics
counter/gauge/histogram confirmed.

Notes / known gaps in local verification

Full HFS server not run locally — SQLite WAL fails over the WSL filesystem
under the Windows toolchain (environment limitation, not a code issue). The
stateless fhirpath-server exercised the same wiring successfully.
Clippy not run with --all-features locally (build window). A pre-existing
dead_code warning on build_embedded_job_store appears only under default
features; CI's --all-features run won't hit it. CI is the authoritative gate.

…oss all servers Add a shared `helios-observability` crate and wire it into all four server binaries (hfs, hts, sof-server, fhirpath-server): - `GET /metrics` Prometheus endpoint (maintained metrics + metrics-exporter-prometheus stack; avoids unmaintained opentelemetry-prometheus / its protobuf RUSTSEC advisory). - `/health` enriched with `uptime_seconds` + `started_at`. - Per-request `http_requests_total` / `http_request_duration_seconds` metrics (templated `route` label; `service` global label) and a tracing span. - Feature-gated (`otel`) OTLP trace export via tracing-opentelemetry; OTLP metrics are produced out-of-process by a Collector scraping /metrics. Tenant is recorded as a span attribute only, never a metric label, to bound Prometheus cardinality. Verified end-to-end against fhirpath-server.

smunini · 2026-07-01T00:50:24Z

@claude and @mauripunzueta how will this work if hfs is run in a clustered environment?

claude · 2026-07-01T00:50:46Z

Claude finished @smunini's task in 1m 16s —— View job

Task List

Read trigger comment and gather PR context
Review observability crate implementation (uptime, metrics, middleware, telemetry)
Analyze clustered/multi-instance implications
Post answer

How this behaves in a clustered deployment (multiple `hfs` replicas)

Short answer: it works, but everything is per-process, in-memory, and coordination-free — the design leans entirely on the scraping/collector infrastructure to reconcile data across replicas, not on the app itself.

/health (uptime_seconds, started_at) — crates/observability/src/uptime.rs:10-11 stores START_INSTANT/STARTED_AT in a process-local OnceLock. Each replica reports its own uptime. That's the correct behavior for k8s-style liveness/readiness probes, since those hit each pod directly by IP, not through a shared load balancer — no change needed here.

GET /metrics — crates/observability/src/metrics.rs:28-37 installs a Prometheus pull exporter (metrics-exporter-prometheus) with in-process counters/histograms via PrometheusBuilder::install_recorder(). There's no shared/aggregated state across replicas — each pod's /metrics reflects only requests it served. This is fine, but only if the Prometheus scrape config targets each replica individually (e.g. Kubernetes PodMonitor/per-pod endpoints via a headless service, ECS Cloud Map per-task discovery), which is the standard pattern. If /metrics is instead scraped through a single load-balanced Service VIP, each scrape round-robins to a random pod, and the resulting series will look broken — uptime_seconds jumping around, http_requests_total counters appearing to reset or skip — because consecutive scrapes hit different processes.

A couple of related points worth flagging:

The recorder attaches a global service="hfs" label (metrics.rs:33) to distinguish the four binaries, but does not add any per-replica identifier (instance/pod/replica). That's intentional — Prometheus itself injects instance (the scrape target address) and, with k8s service discovery, pod/namespace labels — so cross-replica aggregation (sum by (route) (rate(http_requests_total[5m]))) works once scraping is per-pod.
Counters/histograms are in-memory only, so they reset to zero on every restart/rolling deploy per replica. Standard Prometheus rate()/increase() handle this correctly, just calling it out since it's a change from nothing existing before.
/metrics is mounted outside the auth layer (crates/hfs/src/main.rs:556-562, and similarly in hts/sof-server/fhirpath-server) so it's unauthenticated by design (scrapers need no token). In a cluster this should stay reachable only from cluster-internal scrapers (NetworkPolicy / security group), not exposed on a public LB alongside the FHIR API.

OTLP traces (otel feature, crates/observability/src/telemetry.rs:63-90) — this is push-based: each replica independently exports spans to whatever OTEL_EXPORTER_OTLP_ENDPOINT points at (typically a cluster-local OpenTelemetry Collector). No cross-instance coordination needed here either — this is the one signal that's naturally cluster-friendly without any scrape-topology caveat.

Bottom line: nothing in this PR assumes single-instance deployment, but the /metrics design does assume Prometheus scrapes each replica as a distinct target (the normal Prometheus/k8s pattern) rather than through a load-balanced endpoint. That expectation isn't documented anywhere yet — worth a short note in a skill doc or the crate's module docs (crates/observability/src/lib.rs) if this repo's target deployments include clustered/multi-replica setups, so operators wire scrape configs correctly from the start.

…ening Add a management-console metrics surface and the access controls around it, plus a broader auth/tenant/error-handling hardening pass surfaced while reviewing that surface. Console metrics (helios-rest, helios-observability, helios-persistence): - Six JSON endpoints under /console/metrics/* in three trust tiers: public `uptime`; authenticated tenant-scoped `resource-counts`, `activity`, `resource-distribution`; and system-scope-gated cross-tenant `tenants`, `traffic`. - New ResourceStorage methods (count_by_day, count_by_types, activity_histogram, count_all_types, count_by_tenant) with real SQLite/Postgres/Mongo impls and a composite delegate; count_by_types is a single grouped IN(...) query per backend (types bound as parameters, never interpolated). - In-process request-log ring buffer (observability::reqlog) backing the windowed traffic/latency widgets; readers copy under the lock then aggregate so the per-request record() hot path is never stalled. - Per-tenant data is served only in authenticated JSON responses, never as a Prometheus label (tenant stays a trace-span attribute only). Security hardening: - Admin-scope gate: /console/metrics/{tenants,traffic} require a system/*.r scope (ScopeSet::has_system_scope + admin_authz_middleware); ordinary user/patient tokens, even wildcard ones, get 403. - Per-resource authorization under URL-path tenant routing now strips the tenant prefix before classifying the FHIR operation, so the real resource type is authorized (closes a leading-underscore scope-check bypass). - Tenant resolution fails closed when an authenticated token carries no (or an empty) tenant claim: it resolves to the default tenant, or returns 403 if a non-default tenant was requested via header/URL — a spoofed X-Tenant-ID can no longer widen access. - Internal/backend errors are sanitized at one chokepoint (RestError::client_response) used by both the normal and batch/transaction paths; raw DB/driver detail is logged server-side, never returned to clients. - The storage backend name is no longer disclosed on the unauthenticated /health, /_readiness, /console/metrics/uptime, or /metadata endpoints. - `console` is reserved as a system path so it can never be a tenant. - SQLite console date filters use a sargable `last_updated >= ?` bound. BEHAVIOR CHANGE: with auth enabled, a token that carries no tenant claim now resolves to the default tenant, or is rejected 403 if it requests a non-default tenant via header/URL. Multi-tenant-by-header/URL deployments must carry the tenant in the JWT claim (HFS_JWT_TENANT_CLAIM). Developed without a local Rust toolchain; this needs a CI compile/clippy/test pass. Unit tests were added for the scope check, operation classification, error sanitization, reqlog aggregation, and tenant-prefix reservation; the tenant extractor and batch handler paths still need integration coverage.

Bring the console-metrics + observability + auth/tenant-hardening branch up to date with main (workspace 0.1.47 -> 0.2.1; 256 commits incl. bulk-submit, SOF integration, FHIRPath resolve-storage, HTTP compression). Conflict resolutions (main taken as base, our logic re-applied on top): - crates/*/Cargo.toml (x5) + Cargo.lock: main 0.2.1 and main's new deps; re-added the helios-observability dependency at 0.2.1 and the rest `otel` feature. Cargo.lock taken from main and must be regenerated with cargo. - crates/auth/src/scope/mod.rs: kept main's raw / has_system_wildcard / grants_operation; re-applied our has_system_scope (admin-gate primitive). - crates/persistence/src/core/storage.rs: kept main's sof_runner; re-applied our five console count methods plus DailyResourceCount / ActivityCell. - crates/rest/src/error.rs: merged our client_response sanitization refactor with main's new bulk-submit error variants (PayloadTooLarge / Conflict / TooManyRequests / NotSupported); internal class stays scrubbed + logged. - crates/hts/src/server.rs: kept main's compression / body-limit / CORS; re-applied the /metrics route and request-tracking middleware. - CLAUDE.md: took main's restructured version. Verified post-merge by a 5-lens static security review (authz, tenant isolation, injection, DoS, information disclosure): all clean, no medium-or-higher findings, and all prior fixes confirmed intact against main's current code. Compilation and Cargo.lock regeneration are pending CI (no local Rust toolchain).

The merge-resolved and re-applied files were written without a local formatter (no Rust toolchain in the authoring environment); this runs cargo fmt --all so the CI formatting check passes. No logic changes.

cargo clippy -D warnings flagged unnecessary_sort_by on the reverse sort in per_tenant(); switch to sort_by_key(Reverse(..)) per clippy's suggestion. No behavior change (still busiest-first).

cargo audit fails on two new RUSTSEC advisories in quick-xml (0194 quadratic duplicate-attribute scan, 0195 unbounded namespace allocation), pulled transitively at 0.37.5 (object_store) and 0.38.4 (helios-serde xml feature, octofhir-ucum). Fixed only in quick-xml >= 0.41.0, which the consumers cannot adopt without upstream/major bumps. Ambient advisory (main fails identically); ignored to unblock CI, matching the repo's existing advisory-ignore pattern. Revisit and remove once the consumers move to quick-xml >= 0.41.

codecov · 2026-07-02T23:18:45Z

Codecov Report

❌ Patch coverage is 84.58904% with 225 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/persistence/src/backends/mongodb/storage.rs	6.00%	47 Missing ⚠️
crates/rest/src/middleware/auth.rs	80.47%	33 Missing ⚠️
crates/observability/src/metrics.rs	11.11%	24 Missing ⚠️
crates/rest/src/handlers/console_metrics.rs	94.13%	19 Missing ⚠️
crates/rest/src/handlers/batch.rs	75.00%	16 Missing ⚠️
crates/persistence/src/backends/sqlite/storage.rs	91.93%	15 Missing ⚠️
crates/observability/src/telemetry.rs	0.00%	14 Missing ⚠️
crates/persistence/src/core/storage.rs	0.00%	13 Missing ⚠️
crates/persistence/src/composite/storage.rs	0.00%	11 Missing ⚠️
crates/rest/src/lib.rs	56.52%	10 Missing ⚠️
... and 6 more

📢 Thoughts on this report? Let us know!

Answers the PR review question "how does this work in a clustered environment?". - Console responses now carry explicit scope. /traffic and /uptime are scope="single-instance"; /resource-counts, /activity, /resource-distribution and the resources column of /tenants carry a backend-derived scope ("cluster" on Postgres/Mongo/S3 and composites whose primary is one of those, "single-instance" on SQLite) via a new ResourceStorage::is_cluster_shared(); /tenants splits resources_scope (cluster) from traffic_scope (single-instance). - /traffic and /tenants include the answering instance's hostname for per-node attribution; the public /uptime deliberately omits it (unauthenticated - a hostname is infrastructure-identifying, consistent with not exposing the storage backend name there). - Configure Prometheus latency histogram buckets so http_request_duration_seconds is emitted as a real histogram, enabling a correct fleet-wide p95 via histogram_quantile over summed buckets (previously a summary, which cannot be aggregated across instances). - docs/console-metrics.md: new "Clustering / multi-instance" section - shared-DB endpoints are cluster-consistent, in-process traffic/uptime are single-instance (aggregate fleet-wide via Prometheus/Grafana), auth/tenancy is stateless. Reviewed by a 3-perspective design council and a 5-lens security council; no medium-or-higher findings.

mauripunzueta · 2026-07-02T23:59:27Z

Good question @smunini. The resource numbers (resource-counts, activity, resource-distribution, and the per-tenant counts) read from the shared database, so every instance returns the same fleet-wide numbers. The live traffic/latency and uptime come from in-process state, so those are per-instance — I've labeled them explicitly in the responses (a scope field, and an instance on the authenticated ones) so they can't be mistaken for fleet-wide numbers. For cluster-wide traffic/latency you aggregate the per-instance /metrics in Prometheus/Grafana (I configured histogram buckets so p95 aggregates correctly across instances). Auth and tenant resolution are stateless, so no shared session state — any instance can serve any request.

I added a "Clustering / multi-instance" section to docs/console-metrics.md with the details. TL;DR: DB-backed endpoints are cluster-wide; the in-process traffic/uptime views are per-instance (use Prometheus for the fleet view).

…/auth branches Test-only; no production code changed. Raises patch coverage on the console / observability work. - crates/rest/tests/console_metrics.rs (new): integration tests hitting all six /console/metrics/* endpoints on an in-memory SQLite backend (auth-disabled path, as build_app merges them when auth is off), asserting response shape and the scope/instance markers, and that the public /uptime exposes no infrastructure identity (whole-key allowlist). - sqlite storage: unit tests for count_by_day / count_by_types / activity_histogram / count_all_types / count_by_tenant and is_cluster_shared. - auth middleware: extract_operation / extract_operation_for_routing branch tests, including the leading-underscore tenant-prefix bypass regression guard. - tenant extractor: from_request_parts tests for the JWT-claim-authoritative, claimless-token-403, empty-claim, and default-fallback branches (including non-strict "spoofed header is ignored when a claim is present").

…rop unused test import Two fixes surfaced by the new coverage tests in the parent commit: - extractors/tenant.rs: an empty tenant claim (Some("")) with no header/URL tenant requested now resolves to the configured default tenant instead of a 403. The resolver's always-present JwtTenantExtractor re-surfaced the empty claim as the "resolved" tenant, which tripped the missing-claim 403 guard; the guard now ignores an empty resolved tenant. A genuine non-empty, non-default header/URL tenant on a claimless/empty-claim token still returns 403, so tenant isolation is unchanged. - tests/console_metrics.rs: remove an unused `ResourceStorage` import that failed clippy under -D warnings (the console test seeds via HTTP, not backend.create). Reviewed by a 2-lens council (tenant isolation + correctness): isolation preserved, all from_request_parts tests pass.

…es and Mongo Test-only. Adds testcontainer integration tests exercising count_by_types, count_all_types, count_by_day, activity_histogram, count_by_tenant and is_cluster_shared on the PostgreSQL and MongoDB backends (both were 0% on the new console methods), mirroring each file's existing harness (shared PG container / per-test fresh Mongo database, uuid-suffixed tenants for the cross-tenant aggregate). No production code changed. Reviewed for compile-soundness (no duplicate fn names, no unused imports under -D warnings, exact API/signature fidelity) and shared-container flakiness.

smunini mentioned this pull request Jul 1, 2026

fix(sof-export): clean up partial results on cancel, plus a TTL reaper (#144) #180

Open

mauripunzueta added 5 commits July 2, 2026 16:53

style: apply cargo fmt to merged console/observability code

e50eca6

The merge-resolved and re-applied files were written without a local formatter (no Rust toolchain in the authoring environment); this runs cargo fmt --all so the CI formatting check passes. No logic changes.

fix(observability): use sort_by_key in reqlog per_tenant (clippy)

6164295

cargo clippy -D warnings flagged unnecessary_sort_by on the reverse sort in per_tenant(); switch to sort_by_key(Reverse(..)) per clippy's suggestion. No behavior change (still busiest-first).

mauripunzueta added 3 commits July 2, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150
mauripunzueta wants to merge 10 commits into
mainfrom
feat/observability-uptime

mauripunzueta commented Jun 11, 2026

Uh oh!

smunini commented Jul 1, 2026

Uh oh!

claude Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

mauripunzueta commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mauripunzueta commented Jun 11, 2026

Summary

Verification

Notes / known gaps in local verification

Uh oh!

smunini commented Jul 1, 2026

Uh oh!

claude Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Task List

How this behaves in a clustered deployment (multiple hfs replicas)

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mauripunzueta commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jul 1, 2026 •

edited

Loading

How this behaves in a clustered deployment (multiple `hfs` replicas)

codecov Bot commented Jul 2, 2026 •

edited

Loading

mauripunzueta commented Jul 2, 2026 •

edited

Loading