Skip to content

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150

Open
mauripunzueta wants to merge 10 commits into
mainfrom
feat/observability-uptime
Open

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150
mauripunzueta wants to merge 10 commits into
mainfrom
feat/observability-uptime

Conversation

@mauripunzueta

Copy link
Copy Markdown
Contributor

Summary

Adds a shared helios-observability crate and wires it into all four server
binaries (hfs, hts, sof-server, fhirpath-server):

  • GET /metrics Prometheus endpoint — http_requests_total,
    http_request_duration_seconds, uptime_seconds, with a service global
    label and a templated route label.
  • /health enriched with uptime_seconds + started_at.
  • Per-request metrics + a tracing span. Tenant is a span attribute only, never a
    metric label (cardinality).
  • Feature-gated (otel) OTLP trace export via tracing-opentelemetry
    (opentelemetry 0.32). OTLP metrics are produced out-of-process by a Collector
    scraping /metrics — we avoid the unmaintained opentelemetry-prometheus
    (protobuf RUSTSEC advisory).

Verification

  • Workspace cargo check (default) compiles; observability also compiles with
    --features otel. Clippy clean (CI flags) for touched crates. Unit tests pass.
  • End-to-end smoke test against fhirpath-server: /health uptime + /metrics
    counter/gauge/histogram confirmed.

Notes / known gaps in local verification

  1. Full HFS server not run locally — SQLite WAL fails over the WSL filesystem
    under the Windows toolchain (environment limitation, not a code issue). The
    stateless fhirpath-server exercised the same wiring successfully.
  2. Clippy not run with --all-features locally (build window). A pre-existing
    dead_code warning on build_embedded_job_store appears only under default
    features; CI's --all-features run won't hit it. CI is the authoritative gate.

…oss all servers

Add a shared `helios-observability` crate and wire it into all four server
binaries (hfs, hts, sof-server, fhirpath-server):

- `GET /metrics` Prometheus endpoint (maintained metrics +
  metrics-exporter-prometheus stack; avoids unmaintained
  opentelemetry-prometheus / its protobuf RUSTSEC advisory).
- `/health` enriched with `uptime_seconds` + `started_at`.
- Per-request `http_requests_total` / `http_request_duration_seconds` metrics
  (templated `route` label; `service` global label) and a tracing span.
- Feature-gated (`otel`) OTLP trace export via tracing-opentelemetry; OTLP
  metrics are produced out-of-process by a Collector scraping /metrics.

Tenant is recorded as a span attribute only, never a metric label, to bound
Prometheus cardinality. Verified end-to-end against fhirpath-server.
@smunini

smunini commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@claude and @mauripunzueta how will this work if hfs is run in a clustered environment?

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown

Claude finished @smunini's task in 1m 16s —— View job


Task List

  • Read trigger comment and gather PR context
  • Review observability crate implementation (uptime, metrics, middleware, telemetry)
  • Analyze clustered/multi-instance implications
  • Post answer

How this behaves in a clustered deployment (multiple hfs replicas)

Short answer: it works, but everything is per-process, in-memory, and coordination-free — the design leans entirely on the scraping/collector infrastructure to reconcile data across replicas, not on the app itself.

/health (uptime_seconds, started_at)crates/observability/src/uptime.rs:10-11 stores START_INSTANT/STARTED_AT in a process-local OnceLock. Each replica reports its own uptime. That's the correct behavior for k8s-style liveness/readiness probes, since those hit each pod directly by IP, not through a shared load balancer — no change needed here.

GET /metricscrates/observability/src/metrics.rs:28-37 installs a Prometheus pull exporter (metrics-exporter-prometheus) with in-process counters/histograms via PrometheusBuilder::install_recorder(). There's no shared/aggregated state across replicas — each pod's /metrics reflects only requests it served. This is fine, but only if the Prometheus scrape config targets each replica individually (e.g. Kubernetes PodMonitor/per-pod endpoints via a headless service, ECS Cloud Map per-task discovery), which is the standard pattern. If /metrics is instead scraped through a single load-balanced Service VIP, each scrape round-robins to a random pod, and the resulting series will look broken — uptime_seconds jumping around, http_requests_total counters appearing to reset or skip — because consecutive scrapes hit different processes.

A couple of related points worth flagging:

  • The recorder attaches a global service="hfs" label (metrics.rs:33) to distinguish the four binaries, but does not add any per-replica identifier (instance/pod/replica). That's intentional — Prometheus itself injects instance (the scrape target address) and, with k8s service discovery, pod/namespace labels — so cross-replica aggregation (sum by (route) (rate(http_requests_total[5m]))) works once scraping is per-pod.
  • Counters/histograms are in-memory only, so they reset to zero on every restart/rolling deploy per replica. Standard Prometheus rate()/increase() handle this correctly, just calling it out since it's a change from nothing existing before.
  • /metrics is mounted outside the auth layer (crates/hfs/src/main.rs:556-562, and similarly in hts/sof-server/fhirpath-server) so it's unauthenticated by design (scrapers need no token). In a cluster this should stay reachable only from cluster-internal scrapers (NetworkPolicy / security group), not exposed on a public LB alongside the FHIR API.

OTLP traces (otel feature, crates/observability/src/telemetry.rs:63-90) — this is push-based: each replica independently exports spans to whatever OTEL_EXPORTER_OTLP_ENDPOINT points at (typically a cluster-local OpenTelemetry Collector). No cross-instance coordination needed here either — this is the one signal that's naturally cluster-friendly without any scrape-topology caveat.

Bottom line: nothing in this PR assumes single-instance deployment, but the /metrics design does assume Prometheus scrapes each replica as a distinct target (the normal Prometheus/k8s pattern) rather than through a load-balanced endpoint. That expectation isn't documented anywhere yet — worth a short note in a skill doc or the crate's module docs (crates/observability/src/lib.rs) if this repo's target deployments include clustered/multi-replica setups, so operators wire scrape configs correctly from the start.

…ening

Add a management-console metrics surface and the access controls around it,
plus a broader auth/tenant/error-handling hardening pass surfaced while
reviewing that surface.

Console metrics (helios-rest, helios-observability, helios-persistence):
- Six JSON endpoints under /console/metrics/* in three trust tiers: public
  `uptime`; authenticated tenant-scoped `resource-counts`, `activity`,
  `resource-distribution`; and system-scope-gated cross-tenant `tenants`,
  `traffic`.
- New ResourceStorage methods (count_by_day, count_by_types, activity_histogram,
  count_all_types, count_by_tenant) with real SQLite/Postgres/Mongo impls and a
  composite delegate; count_by_types is a single grouped IN(...) query per
  backend (types bound as parameters, never interpolated).
- In-process request-log ring buffer (observability::reqlog) backing the
  windowed traffic/latency widgets; readers copy under the lock then aggregate
  so the per-request record() hot path is never stalled.
- Per-tenant data is served only in authenticated JSON responses, never as a
  Prometheus label (tenant stays a trace-span attribute only).

Security hardening:
- Admin-scope gate: /console/metrics/{tenants,traffic} require a system/*.r scope
  (ScopeSet::has_system_scope + admin_authz_middleware); ordinary user/patient
  tokens, even wildcard ones, get 403.
- Per-resource authorization under URL-path tenant routing now strips the tenant
  prefix before classifying the FHIR operation, so the real resource type is
  authorized (closes a leading-underscore scope-check bypass).
- Tenant resolution fails closed when an authenticated token carries no (or an
  empty) tenant claim: it resolves to the default tenant, or returns 403 if a
  non-default tenant was requested via header/URL — a spoofed X-Tenant-ID can no
  longer widen access.
- Internal/backend errors are sanitized at one chokepoint
  (RestError::client_response) used by both the normal and batch/transaction
  paths; raw DB/driver detail is logged server-side, never returned to clients.
- The storage backend name is no longer disclosed on the unauthenticated
  /health, /_readiness, /console/metrics/uptime, or /metadata endpoints.
- `console` is reserved as a system path so it can never be a tenant.
- SQLite console date filters use a sargable `last_updated >= ?` bound.

BEHAVIOR CHANGE: with auth enabled, a token that carries no tenant claim now
resolves to the default tenant, or is rejected 403 if it requests a non-default
tenant via header/URL. Multi-tenant-by-header/URL deployments must carry the
tenant in the JWT claim (HFS_JWT_TENANT_CLAIM).

Developed without a local Rust toolchain; this needs a CI compile/clippy/test
pass. Unit tests were added for the scope check, operation classification, error
sanitization, reqlog aggregation, and tenant-prefix reservation; the tenant
extractor and batch handler paths still need integration coverage.
Bring the console-metrics + observability + auth/tenant-hardening branch up to
date with main (workspace 0.1.47 -> 0.2.1; 256 commits incl. bulk-submit, SOF
integration, FHIRPath resolve-storage, HTTP compression).

Conflict resolutions (main taken as base, our logic re-applied on top):
- crates/*/Cargo.toml (x5) + Cargo.lock: main 0.2.1 and main's new deps; re-added
  the helios-observability dependency at 0.2.1 and the rest `otel` feature.
  Cargo.lock taken from main and must be regenerated with cargo.
- crates/auth/src/scope/mod.rs: kept main's raw / has_system_wildcard /
  grants_operation; re-applied our has_system_scope (admin-gate primitive).
- crates/persistence/src/core/storage.rs: kept main's sof_runner; re-applied our
  five console count methods plus DailyResourceCount / ActivityCell.
- crates/rest/src/error.rs: merged our client_response sanitization refactor with
  main's new bulk-submit error variants (PayloadTooLarge / Conflict /
  TooManyRequests / NotSupported); internal class stays scrubbed + logged.
- crates/hts/src/server.rs: kept main's compression / body-limit / CORS; re-applied
  the /metrics route and request-tracking middleware.
- CLAUDE.md: took main's restructured version.

Verified post-merge by a 5-lens static security review (authz, tenant isolation,
injection, DoS, information disclosure): all clean, no medium-or-higher findings,
and all prior fixes confirmed intact against main's current code. Compilation and
Cargo.lock regeneration are pending CI (no local Rust toolchain).
The merge-resolved and re-applied files were written without a local formatter
(no Rust toolchain in the authoring environment); this runs cargo fmt --all so
the CI formatting check passes. No logic changes.
cargo clippy -D warnings flagged unnecessary_sort_by on the reverse sort in
per_tenant(); switch to sort_by_key(Reverse(..)) per clippy's suggestion. No
behavior change (still busiest-first).
cargo audit fails on two new RUSTSEC advisories in quick-xml (0194 quadratic
duplicate-attribute scan, 0195 unbounded namespace allocation), pulled
transitively at 0.37.5 (object_store) and 0.38.4 (helios-serde xml feature,
octofhir-ucum). Fixed only in quick-xml >= 0.41.0, which the consumers cannot
adopt without upstream/major bumps. Ambient advisory (main fails identically);
ignored to unblock CI, matching the repo's existing advisory-ignore pattern.
Revisit and remove once the consumers move to quick-xml >= 0.41.
Answers the PR review question "how does this work in a clustered environment?".

- Console responses now carry explicit scope. /traffic and /uptime are
  scope="single-instance"; /resource-counts, /activity, /resource-distribution
  and the resources column of /tenants carry a backend-derived scope
  ("cluster" on Postgres/Mongo/S3 and composites whose primary is one of those,
  "single-instance" on SQLite) via a new ResourceStorage::is_cluster_shared();
  /tenants splits resources_scope (cluster) from traffic_scope (single-instance).
- /traffic and /tenants include the answering instance's hostname for per-node
  attribution; the public /uptime deliberately omits it (unauthenticated -
  a hostname is infrastructure-identifying, consistent with not exposing the
  storage backend name there).
- Configure Prometheus latency histogram buckets so http_request_duration_seconds
  is emitted as a real histogram, enabling a correct fleet-wide p95 via
  histogram_quantile over summed buckets (previously a summary, which cannot be
  aggregated across instances).
- docs/console-metrics.md: new "Clustering / multi-instance" section - shared-DB
  endpoints are cluster-consistent, in-process traffic/uptime are single-instance
  (aggregate fleet-wide via Prometheus/Grafana), auth/tenancy is stateless.

Reviewed by a 3-perspective design council and a 5-lens security council; no
medium-or-higher findings.
@mauripunzueta

mauripunzueta commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Good question @smunini. The resource numbers (resource-counts, activity, resource-distribution, and the per-tenant counts) read from the shared database, so every instance returns the same fleet-wide numbers. The live traffic/latency and uptime come from in-process state, so those are per-instance — I've labeled them explicitly in the responses (a scope field, and an instance on the authenticated ones) so they can't be mistaken for fleet-wide numbers. For cluster-wide traffic/latency you aggregate the per-instance /metrics in Prometheus/Grafana (I configured histogram buckets so p95 aggregates correctly across instances). Auth and tenant resolution are stateless, so no shared session state — any instance can serve any request.

I added a "Clustering / multi-instance" section to docs/console-metrics.md with the details. TL;DR: DB-backed endpoints are cluster-wide; the in-process traffic/uptime views are per-instance (use Prometheus for the fleet view).

…/auth branches

Test-only; no production code changed. Raises patch coverage on the console /
observability work.

- crates/rest/tests/console_metrics.rs (new): integration tests hitting all six
  /console/metrics/* endpoints on an in-memory SQLite backend (auth-disabled
  path, as build_app merges them when auth is off), asserting response shape and
  the scope/instance markers, and that the public /uptime exposes no
  infrastructure identity (whole-key allowlist).
- sqlite storage: unit tests for count_by_day / count_by_types /
  activity_histogram / count_all_types / count_by_tenant and is_cluster_shared.
- auth middleware: extract_operation / extract_operation_for_routing branch
  tests, including the leading-underscore tenant-prefix bypass regression guard.
- tenant extractor: from_request_parts tests for the JWT-claim-authoritative,
  claimless-token-403, empty-claim, and default-fallback branches (including
  non-strict "spoofed header is ignored when a claim is present").
…rop unused test import

Two fixes surfaced by the new coverage tests in the parent commit:

- extractors/tenant.rs: an empty tenant claim (Some("")) with no header/URL
  tenant requested now resolves to the configured default tenant instead of a
  403. The resolver's always-present JwtTenantExtractor re-surfaced the empty
  claim as the "resolved" tenant, which tripped the missing-claim 403 guard; the
  guard now ignores an empty resolved tenant. A genuine non-empty, non-default
  header/URL tenant on a claimless/empty-claim token still returns 403, so tenant
  isolation is unchanged.
- tests/console_metrics.rs: remove an unused `ResourceStorage` import that failed
  clippy under -D warnings (the console test seeds via HTTP, not backend.create).

Reviewed by a 2-lens council (tenant isolation + correctness): isolation
preserved, all from_request_parts tests pass.
…es and Mongo

Test-only. Adds testcontainer integration tests exercising count_by_types,
count_all_types, count_by_day, activity_histogram, count_by_tenant and
is_cluster_shared on the PostgreSQL and MongoDB backends (both were 0% on the
new console methods), mirroring each file's existing harness (shared PG
container / per-test fresh Mongo database, uuid-suffixed tenants for the
cross-tenant aggregate). No production code changed.

Reviewed for compile-soundness (no duplicate fn names, no unused imports under
-D warnings, exact API/signature fidelity) and shared-container flakiness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants