Skip to content

session-server: keep restart auto-create, but make delete stick#35

Open
DavidBellamy wants to merge 1 commit into
prodfrom
session-delete-tombstone
Open

session-server: keep restart auto-create, but make delete stick#35
DavidBellamy wants to merge 1 commit into
prodfrom
session-delete-tombstone

Conversation

@DavidBellamy

@DavidBellamy DavidBellamy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

prod already does half of radixark#888: it auto-creates a session on first access. That half gives router-restart tolerance, but on its own it has a bug: a deleted session comes back to life. A GET or chat for a deleted id silently makes a fresh one, so DELETE no longer sticks. The auto-create half also left a session test asserting the old "unknown -> 404" contract, which now contradicts the route.

This PR adds the other half: tombstone explicitly-deleted ids so delete stays deleted, while still auto-creating ids the server has never seen.

Already on prod (the radixark#888 subset — not re-added here)

Pinned to LLM360/miles@prod at e69b51a:

Not on prod (what this PR adds)

What this PR changes

  • remove_session records the id in a bounded set (the most recent 10k deletes).
  • get_or_create_session and the GET /sessions/{id} route return 404 for a tombstoned (deleted) id, instead of recreating it or returning empty.
  • An id the server has never seen still auto-creates / returns empty, so restart tolerance is preserved.

Tests (reconciled)

  • test_get_session_not_found -> test_get_unknown_session_returns_empty: asserts the intended empty-200 for a never-seen id, fixing the prod contradiction.
  • Adds test_get_deleted_session_returns_404 for the tombstone behavior.
  • The 4 race tests (test_delete_can_proceed_while_chat_is_mid_proxy, test_chat_during_delete_returns_404, test_chat_after_delete_returns_404, test_rapid_create_chat_delete_cycles) pass unchanged.
  • Verified in the miles container: full session suite (test_sessions.py, test_session_pretokenized_e2e.py, test_session_race_conditions.py) is green (56 passed), with the race file green on a repeat run.

@DavidBellamy DavidBellamy requested a review from a team June 6, 2026 05:11
@DavidBellamy DavidBellamy force-pushed the session-delete-tombstone branch from d8ee8e4 to f2ebcd4 Compare June 6, 2026 05:20
@flukeskywalker

Copy link
Copy Markdown

Looks good! A couple of questions:

  1. What are the situations in our setup when a chat or GET request for a deleted id might arrive? What happens currently in those situations after the session gets resurrected?
  2. Depending on 1, under for which rollout batch sizes would 10k be too low as the number of recently deleted IDs to track?

@flukeskywalker

Copy link
Copy Markdown

@nightlessbaron I think we should merge this, but want to make sure you saw this on the off-chance that this has implications for the request abort issues you were noticing.

@DavidBellamy

Copy link
Copy Markdown
Collaborator Author

Looks good! A couple of questions:

  1. What are the situations in our setup when a chat or GET request for a deleted id might arrive? What happens currently in those situations after the session gets resurrected?
  2. Depending on 1, under for which rollout batch sizes would 10k be too low as the number of recently deleted IDs to track?
  1. It's never on purpose. Each conversation gets a fresh, unique ID, so a request for a deleted one is always a leftover retry. The most common cause: the agent retries a message after the conversation was already wrapped up and deleted. Before this fix, that retry quietly created a ghost empty conversation that just sat around wasting space for 2 hours. This fix makes it cleanly fail instead.
  2. The list only needs to remember an ID long enough for stray retries to stop coming (a couple minutes). 10k is plenty unless you're running so many conversations at once that you delete more than 10,000 within that short window, roughly when batch size times samples-per-prompt gets above ~10k. Will we hit this?

PR radixark#888 added auto-create-on-missing for router-restart tolerance, but it
also resurrected explicitly deleted sessions: a chat or GET for a deleted id
silently created a fresh one, so "delete" no longer held. That left four
session tests asserting the old "deleted -> 404" contract failing.

Tombstone explicitly deleted ids (bounded to the most recent 10k): a deleted
id now returns 404 on both the chat and GET paths, while an id the server has
never seen still auto-creates, so restart tolerance is preserved.

Update test_get_session_not_found -> test_get_unknown_session_returns_empty
(asserts the intended empty-200 for a never-seen id) and add
test_get_deleted_session_returns_404 for the tombstone behavior.
@flukeskywalker flukeskywalker force-pushed the session-delete-tombstone branch from f2ebcd4 to 7e548d8 Compare June 12, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants