Skip to content

feat: universe platform QOL#2

Draft
raisedadead wants to merge 5 commits into
mainfrom
feat/readyz
Draft

feat: universe platform QOL#2
raisedadead wants to merge 5 commits into
mainfrom
feat/readyz

Conversation

@raisedadead
Copy link
Copy Markdown
Member

No description provided.

`/healthz` stays an unconditional liveness probe so k8s can tell
the process is alive. `/readyz` is the new readiness probe: it
runs `Health.Ping(ctx)` against the Valkey backend (3s timeout)
then `R2.HasPrefix(ctx, _artemis_readyz_probe_no_match)` against
the bucket (another 3s) and returns either `200 {"ready":true}` or
`503` with a `valkey_unreachable` / `r2_unreachable` code tag in
the body. Valkey probe runs first so an outage there short-circuits
before paying the R2 round-trip — verified by a regression test
that asserts the R2 mock was never called when the Health probe
errors.

A new `RegistryHealth` interface scopes the contract to a single
`Ping(ctx) error` method. `*valkey.Store` already satisfies it so
the only wiring change in cmd/artemis is one extra struct field.
Five handler tests cover the matrix (both up, Valkey down, R2
down, ordering).

Chart-side `readinessProbe` wiring lands in the infra repo in the
next commit; CHANGELOG records the feature under Unreleased.
New /metrics endpoint exposes four prometheus counters:

  artemis_registry_refresh_failures_total
      run() refresh path in registry/valkey/Reader; an optional
      OnRefreshError hook decouples the registry package from
      metrics — cmd/artemis sets it to inc the counter.
  artemis_alias_drift_total
      SitePromote + SiteRollback CAS body-pin mismatches (409).
  artemis_promote_legacy_bare_total
      Empty-body POST /api/site/{site}/promote — surfaces clients
      that have not adopted CAS. Confirms when the v0.3.0 flip to
      require body-pin is safe.
  artemis_upstream_error_total{op}
      Every writeUpstreamError invocation, labelled by op tag
      (r2.put.alias, valkey.register, …). Cross-dependency
      reliability surface.

`handler.SetMetrics` installs a package-level handle so
writeUpstreamError doesn't need a per-call ref; SitePromote /
SiteRollback use h.Metrics directly via Handlers struct.

AccessLog gains a `/healthz` + `/readyz` + `/metrics` skip-map so
k8s probes (and prometheus scrapes) stop flooding the structured
log. server.New() now takes a prometheus.Gatherer; pass nil to
disable the endpoint in tests.

Verified: `go test ./...` = 258/258 green across 10 packages
(3 new tests cover counter registration, writeUpstreamError
increment, AccessLog skip set). `go vet` clean. `go build` clean.
Two defensive fixes surfaced by the pre-push reviewer:

  - `handler.SetMetrics` previously wrote `pkgMetrics` without any
    synchronization. The contract is "call once at startup before
    serving traffic," but nothing enforced it; a second call from a
    rebuild path or a misconfigured test could race against the
    request-side read inside `writeUpstreamError`. Wrap the
    initialization in `sync.Once`. First caller wins; subsequent
    calls are no-ops. sync.Once's happens-before guarantee covers the
    read side without an explicit mutex.

  - `Reader.invokeOnRefreshError` wraps the consumer-supplied
    callback in a panic-recovering shim. A panicking callback in the
    refresh path would have killed the `run()` goroutine and frozen
    the registry snapshot indefinitely; recover() keeps the
    stale-read mode intact and emits a structured log entry naming
    the original refresh error so the operator notices the broken
    callback.

  - Fixed the misleading comment block on `Handlers.Metrics` —
    AccessLog does NOT reach for `pkgMetrics`, only
    `writeUpstreamError` does.

Verified: `go test ./...` 9/9 packages pass; `go vet ./...` clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant