Skip to content

Isolate per-connection failures in graph reconciler#422

Merged
ArnobKumarSaha merged 1 commit into
masterfrom
fix-graph-fault-isolation
Jun 17, 2026
Merged

Isolate per-connection failures in graph reconciler#422
ArnobKumarSaha merged 1 commit into
masterfrom
fix-graph-fault-isolation

Conversation

@ArnobKumarSaha

Copy link
Copy Markdown
Member

Problem

The graph adjacency cache (objGraph) is built per-object by the reconciler in pkg/graph/reconciler.go. ListConnectedObjectIDs evaluated every connection in rd.Spec.Connections, and on the first non-NotFound/NoMatch error it did return nil, err — aborting the whole object. The reconciler then returned without calling objGraph.Update, so the object got zero edges.

Observed in the wild: a PostgreSQL 11 instance whose ui.kubedb.com query view runs a pg_stat_statements query using total_exec_time (a column that only exists in PG ≥ 13). The query fails with pq: column "total_exec_time" does not exist, which poisoned the entire graph build — offshoot, auth_secret, catalog, and the working views were all dropped, even though they resolve purely from the K8s API and have nothing to do with the SQL view.

Fix

  • ListConnectedObjectIDs: a failing connection is now skipped (error wrapped with src -> target context and accumulated), the remaining connections still resolve, and the function returns (edges, utilerrors.NewAggregate(errs)).
  • Reconcile:
    • Discovery errors keep their existing semantics — don't overwrite the graph, requeue fast (500ms). Detected via the new anyDiscoveryError helper (the k8s aggregate type implements Is but not As, so IsDiscoveryError can't see through it).
    • For any other failure, the partial graph is persisted via objGraph.Update first, then the error is returned for exponential-backoff retry.

Net: one bad connection no longer wipes an object's whole graph.

Out of scope

The kubedb-side query bug (total_exec_time vs total_time for PG < 13) lives in a different repo and is unchanged.

Test

  • go build ./pkg/graph/..., go vet, gofmt clean
  • go test ./pkg/graph/... passes

A single connection's runtime error (e.g. a live query against a
workload) aborted the whole graph build for an object, so unrelated
edges (offshoot, auth_secret, catalog, ...) were dropped and never
persisted.

ListConnectedObjectIDs now skips a failing connection and returns the
edges that did resolve along with an aggregated error. The reconciler
persists the partial result, retries discovery errors fast, and returns
other errors for exponential-backoff retry.

Signed-off-by: Arnob kumar saha <arnob@appscode.com>
@ArnobKumarSaha ArnobKumarSaha requested a review from tamalsaha June 17, 2026 09:23
@ArnobKumarSaha ArnobKumarSaha merged commit 2b2e235 into master Jun 17, 2026
4 checks passed
@ArnobKumarSaha ArnobKumarSaha deleted the fix-graph-fault-isolation branch June 17, 2026 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant