-
Notifications
You must be signed in to change notification settings - Fork 1
docs: add gateway list api rfc #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
albertywu
wants to merge
1
commit into
main
Choose a base branch
from
wua/list-api-rfc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+227
−0
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,227 @@ | ||
| # Gateway List API | ||
|
|
||
| Design notes for a gateway `List` API that powers a queue-scoped UX for | ||
| observing SubmitQueue requests over a time window. | ||
|
|
||
| This document captures **design decisions and rationale only**. | ||
|
|
||
| ## Problem | ||
|
|
||
| Users need to inspect what happened in a queue during a time window: which | ||
| requests were still running, which reached a terminal state, and what useful | ||
| state each request is currently in. The existing `Status` API answers that | ||
| question for one `sqid`; the UX needs the same gateway-owned view, but across a | ||
| queue and time range. | ||
|
|
||
| The gateway owns the request log. The orchestrator may emit request-log events, | ||
| but it does not persist or read the log. `List` should preserve that ownership | ||
| boundary and must not read orchestrator-owned working tables. | ||
|
|
||
| ## API Shape | ||
|
|
||
| `List` is a read-only gateway RPC, named tersely to match `Land`, `Cancel`, and | ||
| `Status`. | ||
|
|
||
| At a high level: | ||
|
|
||
| - **Input** — queue name, time window, optional status filters, and pagination | ||
| cursor. | ||
| - **Output** — a page of request summaries and the next cursor. | ||
|
|
||
| Each request summary should include: | ||
|
|
||
| - `sqid` | ||
| - queue | ||
| - current customer-facing status | ||
| - change URIs submitted with the request | ||
| - last error, if any | ||
| - display/debug metadata | ||
| - time the request entered SubmitQueue | ||
| - time the visible state last changed | ||
| - time the request completed, if terminal | ||
| - whether the request is terminal | ||
|
|
||
| The summary intentionally exposes gateway/user-facing lifecycle information, not | ||
| orchestrator implementation details such as batch IDs, internal request states, | ||
| or speculation-tree structure. | ||
|
|
||
| ## Time Window Semantics | ||
|
|
||
| The time window is a lifecycle-overlap filter, not a "started during this | ||
| window" filter. | ||
|
|
||
| A request belongs in `[T1, T2)` when it was active at any point in that interval: | ||
|
|
||
| - it started before `T2`, and | ||
| - it either has not completed, or completed at or after `T1`. | ||
|
|
||
| This is the behavior the UX needs for questions like "what was running between | ||
| 10:00 and 10:30?" A request that began at 09:55 and completed at 10:05 should | ||
| appear. A request that began at 10:20 and is still running should appear. A | ||
| request that completed at 09:59 should not. | ||
|
|
||
| `List` returns the request's **current** reconciled status at read time for rows | ||
| that match the window. It is not a historical "status as of T2" API. A historical | ||
| snapshot API would be a different product shape and should be designed | ||
| separately if needed. | ||
|
|
||
| ## Status Filtering | ||
|
|
||
| `List` should support filtering by the same customer-facing status strings that | ||
| `Status` returns: examples include `accepted`, `validating`, `building`, | ||
| `landing`, `landed`, `error`, `cancelling`, and `cancelled`. | ||
|
|
||
| This keeps the API stable at the same abstraction level as `Status`. Clients do | ||
| not need to learn an internal enum or translate orchestrator state-machine | ||
| values into display states. | ||
|
|
||
| The status filter applies to the request's **current** reconciled status after | ||
| the queue/time-window match has been computed. It does not mean "requests that | ||
| ever had this status during the window." That historical event query belongs | ||
| with a timeline/debug API, not the queue summary list. | ||
|
|
||
| The filter should accept multiple statuses so the UX can ask for groups such as | ||
| "currently active" or "terminal outcomes" without making separate RPC calls. The | ||
| server should validate status strings against the public status vocabulary it | ||
| can emit; unknown statuses are caller errors rather than silent misses. | ||
|
|
||
| ## Read Model | ||
|
|
||
| Serving `List` directly from the append-only request log would force the gateway | ||
| to scan and reconcile many log rows per request. That is the wrong shape for a | ||
| queue dashboard. | ||
|
|
||
| The gateway should maintain a request-summary read model derived from the | ||
| request log. Every request-log write updates two gateway-owned views: | ||
|
|
||
| - the immutable request log, used for audit/debug history and point | ||
| reconciliation; | ||
| - the mutable request summary, used for bounded queue/time-window listing. | ||
|
|
||
| The summary row is a materialized current view of the same state that `Status` | ||
| would report. `Status` may continue reading and reconciling from the log during | ||
| rollout; the important invariant is that both views use the same reconciliation | ||
| rules. | ||
|
|
||
| This is deliberately a query store, unlike the mostly key-oriented stores used | ||
| by the pipeline. Its boundary should be page-in/page-out: queue, time window, | ||
| statuses, cursor, and limit in; rows plus next cursor out. The backend owns the | ||
| indexing strategy for lifecycle overlap. For SQL, avoid an unindexed open-ended | ||
| OR by representing "still running" with an index-friendly sentinel completion | ||
| time or by splitting active and completed scans. | ||
|
|
||
| Every request-log persistence path must update this read model through the same | ||
| helper: direct gateway writes such as `Land` and `Cancel`, plus the gateway log | ||
| sink that persists orchestrator-emitted events. The invariant is | ||
| `RequestLogStore.Insert` paired with a guarded summary upsert, not best-effort | ||
| ad hoc updates at each call site. | ||
|
|
||
| Request-log events should carry `queue` as first-class data. The log sink only | ||
| receives the log event, so relying on `sqid` parsing would make the read model | ||
| depend on an ID-format convention. Legacy backfills may parse queue from `sqid` | ||
| as a fallback, but new events should be queue-attributable at the source. | ||
|
|
||
| ## Change URIs | ||
|
|
||
| Request summaries should include the change URIs submitted with the request. The | ||
| UX needs them to make each row recognizable and actionable without an additional | ||
| lookup. | ||
|
|
||
| To support this cleanly, the gateway must capture change URIs at request | ||
| acceptance time. `Land` already receives the change set before handing work to | ||
| the orchestrator, so it is the right boundary to persist that display data into | ||
| the gateway-owned request log and summary read model. | ||
|
|
||
| This should not be implemented by joining from `List` into orchestrator-owned | ||
| request tables. That would break the service ownership model and couple a UX | ||
| read path to pipeline internals. | ||
|
|
||
| For existing requests, change URIs are available only if they can be recovered | ||
| from gateway-owned data. If old request-log entries do not contain them, the | ||
| backfill can still build summaries, but those older rows will have empty change | ||
| URIs unless a separate one-time migration from an authoritative source is | ||
| accepted explicitly. | ||
|
|
||
| ## Reconciliation | ||
|
|
||
| Request-log timestamps are useful for display and broad ordering, but they are | ||
| not always the strongest signal for "current state." Some log entries reflect | ||
| informational progress, while others reflect versioned request-state changes. | ||
|
|
||
| `Status` reconciles by reading all request-log rows at once. The summary must do | ||
| the equivalent incrementally, one incoming event at a time. Each update is a | ||
| guarded merge between the stored winner and the incoming log record, never a | ||
| blind last-write-wins overwrite. | ||
|
|
||
| The summary should persist enough comparison state to make that decision: | ||
| winning status, winning request version, winning timestamp, and whether the | ||
| winner is a versioned terminal state. The incoming event replaces the stored | ||
| winner only when it would have won in the full-log reconciliation: | ||
|
|
||
| - terminal request-state records with a request version are authoritative; | ||
| - among versioned terminal records, the highest request version wins, with | ||
| timestamp as a tie-breaker; | ||
| - if no terminal versioned winner exists yet, the newest log timestamp wins. | ||
|
|
||
| When the winning state is terminal, the summary records a completion time. When | ||
| the winning state is non-terminal, completion time is empty and the request is | ||
| considered active for future time-window overlap. | ||
|
|
||
| ## Pagination | ||
|
|
||
| `List` should be cursor-paginated. Offset pagination is the wrong fit because the | ||
| underlying set changes while users page through it. | ||
|
|
||
| The cursor should be opaque to clients and tied to the original query shape: | ||
| queue, time window, status filter, and the last row seen. Reusing a cursor with a | ||
| different queue, time window, or status filter should be rejected. | ||
|
|
||
| Default page size should be modest. The API should cap page size so a single UX | ||
| request cannot force an unbounded queue scan. | ||
|
|
||
| ## Retention | ||
|
|
||
| The first retention target is 30 days after completion. Non-terminal requests | ||
| must never be purged by age alone; a request that started 40 days ago and is | ||
| still running must appear in a current overlap query. | ||
|
|
||
| Terminal summaries and detailed logs can expire 30 days after completion. | ||
| Detailed logs may have a separate policy later only if the UX no longer needs | ||
| timeline/debug information for the same period. | ||
|
|
||
| ## Flow | ||
|
|
||
| ``` | ||
| ┌────────────────────────────────────────────┐ | ||
| │ gateway:Land / gateway:Cancel / log sink │ | ||
| │ persist request-log event │ | ||
| │ update request summary │ | ||
| └──────────────────────────┬─────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌────────────────────────────────────────────┐ | ||
| │ gateway:List │ | ||
| │ validate queue + time window + statuses │ | ||
| │ read summaries by lifecycle/status match │ | ||
| │ return page of current request summaries │ | ||
| └────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Why Not Reuse `Status` | ||
|
|
||
| `Status` is a point lookup: one `sqid`, one current answer. Keeping it narrow | ||
| makes it cheap and predictable for polling and integrations. | ||
|
|
||
| `List` is a collection query: one queue, one time window, many request summaries. | ||
| It needs pagination, time filtering, optional status filtering, and a read model | ||
| shaped for queue UX. Those semantics do not belong in `Status`. | ||
|
|
||
| ## Why Not Return Timelines | ||
|
|
||
| Timelines are useful for debugging, but they are not part of the first `List` | ||
| shape. Returning per-request histories in every list row would make page cost | ||
| scale with both the number of requests and the number of events per request. | ||
|
|
||
| The first API should return summaries only. If the UX later needs row expansion, | ||
| add a dedicated timeline/debug API that reads the append-only request log for one | ||
| `sqid`. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add sorting as well? generally FIFO/time-based would be fine..but often we need to know or say what at the head of the queue and time-based listing may not reflect that clearly?