feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150
feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150mauripunzueta wants to merge 10 commits into
Conversation
…oss all servers Add a shared `helios-observability` crate and wire it into all four server binaries (hfs, hts, sof-server, fhirpath-server): - `GET /metrics` Prometheus endpoint (maintained metrics + metrics-exporter-prometheus stack; avoids unmaintained opentelemetry-prometheus / its protobuf RUSTSEC advisory). - `/health` enriched with `uptime_seconds` + `started_at`. - Per-request `http_requests_total` / `http_request_duration_seconds` metrics (templated `route` label; `service` global label) and a tracing span. - Feature-gated (`otel`) OTLP trace export via tracing-opentelemetry; OTLP metrics are produced out-of-process by a Collector scraping /metrics. Tenant is recorded as a span attribute only, never a metric label, to bound Prometheus cardinality. Verified end-to-end against fhirpath-server.
|
@claude and @mauripunzueta how will this work if hfs is run in a clustered environment? |
|
Claude finished @smunini's task in 1m 16s —— View job Task List
How this behaves in a clustered deployment (multiple
|
…ening
Add a management-console metrics surface and the access controls around it,
plus a broader auth/tenant/error-handling hardening pass surfaced while
reviewing that surface.
Console metrics (helios-rest, helios-observability, helios-persistence):
- Six JSON endpoints under /console/metrics/* in three trust tiers: public
`uptime`; authenticated tenant-scoped `resource-counts`, `activity`,
`resource-distribution`; and system-scope-gated cross-tenant `tenants`,
`traffic`.
- New ResourceStorage methods (count_by_day, count_by_types, activity_histogram,
count_all_types, count_by_tenant) with real SQLite/Postgres/Mongo impls and a
composite delegate; count_by_types is a single grouped IN(...) query per
backend (types bound as parameters, never interpolated).
- In-process request-log ring buffer (observability::reqlog) backing the
windowed traffic/latency widgets; readers copy under the lock then aggregate
so the per-request record() hot path is never stalled.
- Per-tenant data is served only in authenticated JSON responses, never as a
Prometheus label (tenant stays a trace-span attribute only).
Security hardening:
- Admin-scope gate: /console/metrics/{tenants,traffic} require a system/*.r scope
(ScopeSet::has_system_scope + admin_authz_middleware); ordinary user/patient
tokens, even wildcard ones, get 403.
- Per-resource authorization under URL-path tenant routing now strips the tenant
prefix before classifying the FHIR operation, so the real resource type is
authorized (closes a leading-underscore scope-check bypass).
- Tenant resolution fails closed when an authenticated token carries no (or an
empty) tenant claim: it resolves to the default tenant, or returns 403 if a
non-default tenant was requested via header/URL — a spoofed X-Tenant-ID can no
longer widen access.
- Internal/backend errors are sanitized at one chokepoint
(RestError::client_response) used by both the normal and batch/transaction
paths; raw DB/driver detail is logged server-side, never returned to clients.
- The storage backend name is no longer disclosed on the unauthenticated
/health, /_readiness, /console/metrics/uptime, or /metadata endpoints.
- `console` is reserved as a system path so it can never be a tenant.
- SQLite console date filters use a sargable `last_updated >= ?` bound.
BEHAVIOR CHANGE: with auth enabled, a token that carries no tenant claim now
resolves to the default tenant, or is rejected 403 if it requests a non-default
tenant via header/URL. Multi-tenant-by-header/URL deployments must carry the
tenant in the JWT claim (HFS_JWT_TENANT_CLAIM).
Developed without a local Rust toolchain; this needs a CI compile/clippy/test
pass. Unit tests were added for the scope check, operation classification, error
sanitization, reqlog aggregation, and tenant-prefix reservation; the tenant
extractor and batch handler paths still need integration coverage.
Bring the console-metrics + observability + auth/tenant-hardening branch up to date with main (workspace 0.1.47 -> 0.2.1; 256 commits incl. bulk-submit, SOF integration, FHIRPath resolve-storage, HTTP compression). Conflict resolutions (main taken as base, our logic re-applied on top): - crates/*/Cargo.toml (x5) + Cargo.lock: main 0.2.1 and main's new deps; re-added the helios-observability dependency at 0.2.1 and the rest `otel` feature. Cargo.lock taken from main and must be regenerated with cargo. - crates/auth/src/scope/mod.rs: kept main's raw / has_system_wildcard / grants_operation; re-applied our has_system_scope (admin-gate primitive). - crates/persistence/src/core/storage.rs: kept main's sof_runner; re-applied our five console count methods plus DailyResourceCount / ActivityCell. - crates/rest/src/error.rs: merged our client_response sanitization refactor with main's new bulk-submit error variants (PayloadTooLarge / Conflict / TooManyRequests / NotSupported); internal class stays scrubbed + logged. - crates/hts/src/server.rs: kept main's compression / body-limit / CORS; re-applied the /metrics route and request-tracking middleware. - CLAUDE.md: took main's restructured version. Verified post-merge by a 5-lens static security review (authz, tenant isolation, injection, DoS, information disclosure): all clean, no medium-or-higher findings, and all prior fixes confirmed intact against main's current code. Compilation and Cargo.lock regeneration are pending CI (no local Rust toolchain).
The merge-resolved and re-applied files were written without a local formatter (no Rust toolchain in the authoring environment); this runs cargo fmt --all so the CI formatting check passes. No logic changes.
cargo clippy -D warnings flagged unnecessary_sort_by on the reverse sort in per_tenant(); switch to sort_by_key(Reverse(..)) per clippy's suggestion. No behavior change (still busiest-first).
cargo audit fails on two new RUSTSEC advisories in quick-xml (0194 quadratic duplicate-attribute scan, 0195 unbounded namespace allocation), pulled transitively at 0.37.5 (object_store) and 0.38.4 (helios-serde xml feature, octofhir-ucum). Fixed only in quick-xml >= 0.41.0, which the consumers cannot adopt without upstream/major bumps. Ambient advisory (main fails identically); ignored to unblock CI, matching the repo's existing advisory-ignore pattern. Revisit and remove once the consumers move to quick-xml >= 0.41.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Answers the PR review question "how does this work in a clustered environment?".
- Console responses now carry explicit scope. /traffic and /uptime are
scope="single-instance"; /resource-counts, /activity, /resource-distribution
and the resources column of /tenants carry a backend-derived scope
("cluster" on Postgres/Mongo/S3 and composites whose primary is one of those,
"single-instance" on SQLite) via a new ResourceStorage::is_cluster_shared();
/tenants splits resources_scope (cluster) from traffic_scope (single-instance).
- /traffic and /tenants include the answering instance's hostname for per-node
attribution; the public /uptime deliberately omits it (unauthenticated -
a hostname is infrastructure-identifying, consistent with not exposing the
storage backend name there).
- Configure Prometheus latency histogram buckets so http_request_duration_seconds
is emitted as a real histogram, enabling a correct fleet-wide p95 via
histogram_quantile over summed buckets (previously a summary, which cannot be
aggregated across instances).
- docs/console-metrics.md: new "Clustering / multi-instance" section - shared-DB
endpoints are cluster-consistent, in-process traffic/uptime are single-instance
(aggregate fleet-wide via Prometheus/Grafana), auth/tenancy is stateless.
Reviewed by a 3-perspective design council and a 5-lens security council; no
medium-or-higher findings.
|
Good question @smunini. The resource numbers ( I added a "Clustering / multi-instance" section to |
…/auth branches Test-only; no production code changed. Raises patch coverage on the console / observability work. - crates/rest/tests/console_metrics.rs (new): integration tests hitting all six /console/metrics/* endpoints on an in-memory SQLite backend (auth-disabled path, as build_app merges them when auth is off), asserting response shape and the scope/instance markers, and that the public /uptime exposes no infrastructure identity (whole-key allowlist). - sqlite storage: unit tests for count_by_day / count_by_types / activity_histogram / count_all_types / count_by_tenant and is_cluster_shared. - auth middleware: extract_operation / extract_operation_for_routing branch tests, including the leading-underscore tenant-prefix bypass regression guard. - tenant extractor: from_request_parts tests for the JWT-claim-authoritative, claimless-token-403, empty-claim, and default-fallback branches (including non-strict "spoofed header is ignored when a claim is present").
…rop unused test import
Two fixes surfaced by the new coverage tests in the parent commit:
- extractors/tenant.rs: an empty tenant claim (Some("")) with no header/URL
tenant requested now resolves to the configured default tenant instead of a
403. The resolver's always-present JwtTenantExtractor re-surfaced the empty
claim as the "resolved" tenant, which tripped the missing-claim 403 guard; the
guard now ignores an empty resolved tenant. A genuine non-empty, non-default
header/URL tenant on a claimless/empty-claim token still returns 403, so tenant
isolation is unchanged.
- tests/console_metrics.rs: remove an unused `ResourceStorage` import that failed
clippy under -D warnings (the console test seeds via HTTP, not backend.create).
Reviewed by a 2-lens council (tenant isolation + correctness): isolation
preserved, all from_request_parts tests pass.
…es and Mongo Test-only. Adds testcontainer integration tests exercising count_by_types, count_all_types, count_by_day, activity_histogram, count_by_tenant and is_cluster_shared on the PostgreSQL and MongoDB backends (both were 0% on the new console methods), mirroring each file's existing harness (shared PG container / per-test fresh Mongo database, uuid-suffixed tenants for the cross-tenant aggregate). No production code changed. Reviewed for compile-soundness (no duplicate fn names, no unused imports under -D warnings, exact API/signature fidelity) and shared-container flakiness.
Summary
Adds a shared
helios-observabilitycrate and wires it into all four serverbinaries (
hfs,hts,sof-server,fhirpath-server):GET /metricsPrometheus endpoint —http_requests_total,http_request_duration_seconds,uptime_seconds, with aservicegloballabel and a templated
routelabel./healthenriched withuptime_seconds+started_at.metric label (cardinality).
otel) OTLP trace export viatracing-opentelemetry(opentelemetry 0.32). OTLP metrics are produced out-of-process by a Collector
scraping
/metrics— we avoid the unmaintainedopentelemetry-prometheus(protobuf RUSTSEC advisory).
Verification
cargo check(default) compiles; observability also compiles with--features otel. Clippy clean (CI flags) for touched crates. Unit tests pass.fhirpath-server:/healthuptime +/metricscounter/gauge/histogram confirmed.
Notes / known gaps in local verification
under the Windows toolchain (environment limitation, not a code issue). The
stateless
fhirpath-serverexercised the same wiring successfully.--all-featureslocally (build window). A pre-existingdead_codewarning onbuild_embedded_job_storeappears only under defaultfeatures; CI's
--all-featuresrun won't hit it. CI is the authoritative gate.