Skip to content

feat(ipc): unify connection handles into poll selector ids (server push)#244

Merged
singaraiona merged 2 commits into
masterfrom
feat/unified-ipc-handles
Jun 12, 2026
Merged

feat(ipc): unify connection handles into poll selector ids (server push)#244
singaraiona merged 2 commits into
masterfrom
feat/unified-ipc-handles

Conversation

@singaraiona

Copy link
Copy Markdown
Collaborator

Problem

A user hit this on 2.1.0:

‣ (set .ipc.on.open (fn [h] (.ipc.post h "(+ 1 1")))
lambda
‣ ipc: .ipc.on.open hook raised an error (handle=2)

Posting on the handle a hook receives always failed — inside the hook and later with a stored handle. Root cause: two disjoint handle namespaces. .ipc.open returned an index into a private client fd table (g_client_fds[]), while hooks received the inbound connection's poll selector id. .ipc.send/.ipc.post/.ipc.close only consulted the client table, so server-side handles never resolved — and since both namespaces are small integers, a server with its own outbound connections could even deliver to the wrong peer silently.

Change

A handle is now one thing: the poll selector id of an IPC connection, inbound or outbound.

  • ray_ipc_connect does the blocking handshake/auth as before, then hands the socket to the active poll with the same read pump inbound connections use, returning the selector id. The client fd table is deleted.
  • .ipc.send / .ipc.post / .ipc.close resolve any handle in the active poll (thread-local poll of the dispatching hook/eval, else the runtime's main poll). Hooks can push through — or close — the handles they receive, kdb-style (neg[h] analog). Resolution validates the target is a handshake-complete IPC connection, so a stray integer can't write to the listener, stdin, or a mid-handshake socket.
  • .ipc.send now pumps the connection while waiting for its response: interleaved pushed asyncs are dispatched, and sync requests from the peer are answered — full duplex. This also fixes a latent corruption bug: the old response reader never checked hdr.msgtype, so a pushed frame would have been consumed as the next sync response.
  • ipc_read_payload detaches the payload buffer and advances the state machine before evaluating, making nested round-trips on the same connection re-entrancy-safe; RESP frames are deposited for the waiting sender instead of being evaluated.
  • Lifecycle hooks remain inbound-only, preserving the documented on.open/on.close 1:1 pairing.

Behaviour notes

  • Wire format unchanged; old and new peers interoperate.
  • .ipc.open handle values shift (selector ids instead of table indices) — they were always opaque.
  • Idle interactive/server processes dispatch pushes via the poll loop; script-mode clients drain them during their next .ipc.send wait.
  • Selector ids are reused after close (like kdb fds): don't cache handles past on.close. A generation tag could harden this later.
  • A corrupt compressed request now produces a null response instead of leaving the sync caller hanging.

Tests

  • New ipc/server_push test: push from inside .ipc.on.open and push later through a stored handle, delivered across a real round-trip.
  • Test fixtures (test_ipc.c, test_store.c, test_repl.c, .rfl runner) now publish a runtime poll, mirroring main.c.
  • syscmd_coverage.rfl updated: its .sys.listen → nyi assertions relied on the test runner having no poll; valid ports now genuinely bind, so it asserts the success and double-bind io branches instead.
  • Full suite under ASan/UBSan: 3375 of 3377 passed, 2 skipped, 0 failed.
  • Verified end-to-end with two processes: hook-installed push and out-of-hook push both delivered and evaluated client-side.

Fixes the reported async-IPC-from-server bug.

A handle was two different things depending on which side of the wire
allocated it: .ipc.open returned an index into a private client fd
table, while server-side hooks received the inbound connection's poll
selector id. The two namespaces looked identical (small integers) but
only the client table was writable, so (.ipc.post h ...) inside
.ipc.on.open failed with an io error — and worse, could silently hit
the WRONG outbound peer when the numbers aliased.

Now a handle is the poll selector id of a connection, both directions:

- ray_ipc_connect registers the outbound socket in the active poll
  (thread-local poll of the dispatching hook/eval, else the runtime's
  main poll) with the same read pump inbound conns use, and returns
  the selector id. The g_client_fds table is gone.
- .ipc.send/.ipc.post/.ipc.close resolve any handle the same way, so
  server code can push through (or close) the very handles its hooks
  receive — kdb-style. Resolution validates the selector is a
  handshake-complete IPC conn; stray integers can't write to the
  listener, stdin, or a mid-handshake socket.
- .ipc.send pumps the connection's rx machinery while waiting for its
  RESP, dispatching interleaved frames (pushed asyncs are evaluated,
  peer sync requests answered). This also fixes the latent bug where
  the response reader never checked hdr.msgtype and would have
  consumed a pushed frame as the next sync response.
- ipc_read_payload detaches the payload buffer and advances the state
  machine BEFORE evaluating, making nested round-trips on the same
  connection safe; RESP frames are deposited for the waiting sender.
- Lifecycle hooks stay inbound-only: outbound conns never fire
  on.open/on.close, preserving the documented 1:1 pairing.

Test fixtures (test_ipc/test_store/test_repl/.rfl runner) now publish
a runtime poll, mirroring main.c. syscmd_coverage.rfl's listen
assertions relied on the runner having no poll ("nyi" branch); valid
ports now genuinely bind, so it asserts the success and double-bind io
branches instead. New ipc/server_push test covers push-from-hook and
push-outside-hook end to end.

Known limits (unchanged in spirit): selector ids are reused after
close, so handles must not be cached past on.close; sends assume the
single poll thread.
…ls onto it

The agreed state model is two exact points: ray_runtime_t for the
process, ray_vm_t per thread. The code had drifted from it in two ways:

1. __VM was not per-thread. The bytecode executor allocated a fresh
   zeroed VM per compiled-lambda invocation, published it as __VM, and
   set __VM = NULL on exit — never restoring the runtime's VM. Anything
   stored on the VM neither survived into a nested compiled call nor
   existed between calls, which is exactly why ~10 ad-hoc _Thread_local
   statics had accumulated across eval.c, env.c, and ipc.c.

2. Two divergent ray_vm_t definitions (lang/eval.h vs core/runtime.h)
   were type-punned through the extern; runtime.h's extra fields
   (err/nfo/trace/raise_val/scope) were a half-finished mirror of the
   eval/env thread-locals.

Now:

- ONE ray_vm_t (core/runtime.h): a persistent per-thread object holding
  error/debug state, eval dispatch context (depth, restricted), IPC
  dispatch context (handle/poll), and the lexical scope stack. Hot
  fields are packed into the first 32 bytes (ray_sys_alloc data starts
  at page+32, so offsets 0..31 share one cache line); the cold err
  buffer and the big scope array sit at the tail.
- The bytecode executor allocates a private ray_exec_t (lang/eval.h:
  the interpreter stacks only) per invocation and never touches __VM.
- Migrated onto the VM: eval_depth, g_eval_nfo, g_error_trace,
  g_eval_restricted, __raise_val (eval.c); scope_stack/scope_depth
  (env.c); the IPC dispatch handle/poll (ipc.c). ray_eval and the
  executor guard entry with "no VM bound to this thread".
- ray_vm_init() is the only sanctioned way to create a VM — memset
  alone would leave ipc_handle at 0, a valid handle; the no-dispatch
  sentinel is -1. Runtime + all test fixtures use it.
- Deliberately left module-local: per-thread caches/PRNGs and
  qsort-comparator plumbing (affine_sum/count_compare caches, sf_* sym
  cache, hnsw/guid RNG state, dsort/group-emit/serde-depth scratch) —
  implementation details with no cross-subsystem reach.

Eval-heavy release benchmark (fib 27 + 1M map-lambda calls + 2000x
depth-300 recursion): ~2-3% faster than master — the executor no longer
zeroes the old VM's err+scope tail on every invocation.
@singaraiona singaraiona merged commit 11a54dd into master Jun 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant