feat(ipc): unify connection handles into poll selector ids (server push)#244
Merged
Conversation
A handle was two different things depending on which side of the wire
allocated it: .ipc.open returned an index into a private client fd
table, while server-side hooks received the inbound connection's poll
selector id. The two namespaces looked identical (small integers) but
only the client table was writable, so (.ipc.post h ...) inside
.ipc.on.open failed with an io error — and worse, could silently hit
the WRONG outbound peer when the numbers aliased.
Now a handle is the poll selector id of a connection, both directions:
- ray_ipc_connect registers the outbound socket in the active poll
(thread-local poll of the dispatching hook/eval, else the runtime's
main poll) with the same read pump inbound conns use, and returns
the selector id. The g_client_fds table is gone.
- .ipc.send/.ipc.post/.ipc.close resolve any handle the same way, so
server code can push through (or close) the very handles its hooks
receive — kdb-style. Resolution validates the selector is a
handshake-complete IPC conn; stray integers can't write to the
listener, stdin, or a mid-handshake socket.
- .ipc.send pumps the connection's rx machinery while waiting for its
RESP, dispatching interleaved frames (pushed asyncs are evaluated,
peer sync requests answered). This also fixes the latent bug where
the response reader never checked hdr.msgtype and would have
consumed a pushed frame as the next sync response.
- ipc_read_payload detaches the payload buffer and advances the state
machine BEFORE evaluating, making nested round-trips on the same
connection safe; RESP frames are deposited for the waiting sender.
- Lifecycle hooks stay inbound-only: outbound conns never fire
on.open/on.close, preserving the documented 1:1 pairing.
Test fixtures (test_ipc/test_store/test_repl/.rfl runner) now publish
a runtime poll, mirroring main.c. syscmd_coverage.rfl's listen
assertions relied on the runner having no poll ("nyi" branch); valid
ports now genuinely bind, so it asserts the success and double-bind io
branches instead. New ipc/server_push test covers push-from-hook and
push-outside-hook end to end.
Known limits (unchanged in spirit): selector ids are reused after
close, so handles must not be cached past on.close; sends assume the
single poll thread.
…ls onto it The agreed state model is two exact points: ray_runtime_t for the process, ray_vm_t per thread. The code had drifted from it in two ways: 1. __VM was not per-thread. The bytecode executor allocated a fresh zeroed VM per compiled-lambda invocation, published it as __VM, and set __VM = NULL on exit — never restoring the runtime's VM. Anything stored on the VM neither survived into a nested compiled call nor existed between calls, which is exactly why ~10 ad-hoc _Thread_local statics had accumulated across eval.c, env.c, and ipc.c. 2. Two divergent ray_vm_t definitions (lang/eval.h vs core/runtime.h) were type-punned through the extern; runtime.h's extra fields (err/nfo/trace/raise_val/scope) were a half-finished mirror of the eval/env thread-locals. Now: - ONE ray_vm_t (core/runtime.h): a persistent per-thread object holding error/debug state, eval dispatch context (depth, restricted), IPC dispatch context (handle/poll), and the lexical scope stack. Hot fields are packed into the first 32 bytes (ray_sys_alloc data starts at page+32, so offsets 0..31 share one cache line); the cold err buffer and the big scope array sit at the tail. - The bytecode executor allocates a private ray_exec_t (lang/eval.h: the interpreter stacks only) per invocation and never touches __VM. - Migrated onto the VM: eval_depth, g_eval_nfo, g_error_trace, g_eval_restricted, __raise_val (eval.c); scope_stack/scope_depth (env.c); the IPC dispatch handle/poll (ipc.c). ray_eval and the executor guard entry with "no VM bound to this thread". - ray_vm_init() is the only sanctioned way to create a VM — memset alone would leave ipc_handle at 0, a valid handle; the no-dispatch sentinel is -1. Runtime + all test fixtures use it. - Deliberately left module-local: per-thread caches/PRNGs and qsort-comparator plumbing (affine_sum/count_compare caches, sf_* sym cache, hnsw/guid RNG state, dsort/group-emit/serde-depth scratch) — implementation details with no cross-subsystem reach. Eval-heavy release benchmark (fib 27 + 1M map-lambda calls + 2000x depth-300 recursion): ~2-3% faster than master — the executor no longer zeroes the old VM's err+scope tail on every invocation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A user hit this on
2.1.0:Posting on the handle a hook receives always failed — inside the hook and later with a stored handle. Root cause: two disjoint handle namespaces.
.ipc.openreturned an index into a private client fd table (g_client_fds[]), while hooks received the inbound connection's poll selector id..ipc.send/.ipc.post/.ipc.closeonly consulted the client table, so server-side handles never resolved — and since both namespaces are small integers, a server with its own outbound connections could even deliver to the wrong peer silently.Change
A handle is now one thing: the poll selector id of an IPC connection, inbound or outbound.
ray_ipc_connectdoes the blocking handshake/auth as before, then hands the socket to the active poll with the same read pump inbound connections use, returning the selector id. The client fd table is deleted..ipc.send/.ipc.post/.ipc.closeresolve any handle in the active poll (thread-local poll of the dispatching hook/eval, else the runtime's main poll). Hooks can push through — or close — the handles they receive, kdb-style (neg[h]analog). Resolution validates the target is a handshake-complete IPC connection, so a stray integer can't write to the listener, stdin, or a mid-handshake socket..ipc.sendnow pumps the connection while waiting for its response: interleaved pushed asyncs are dispatched, and sync requests from the peer are answered — full duplex. This also fixes a latent corruption bug: the old response reader never checkedhdr.msgtype, so a pushed frame would have been consumed as the next sync response.ipc_read_payloaddetaches the payload buffer and advances the state machine before evaluating, making nested round-trips on the same connection re-entrancy-safe;RESPframes are deposited for the waiting sender instead of being evaluated.on.open/on.close1:1 pairing.Behaviour notes
.ipc.openhandle values shift (selector ids instead of table indices) — they were always opaque..ipc.sendwait.on.close. A generation tag could harden this later.Tests
ipc/server_pushtest: push from inside.ipc.on.openand push later through a stored handle, delivered across a real round-trip.test_ipc.c,test_store.c,test_repl.c,.rflrunner) now publish a runtime poll, mirroringmain.c.syscmd_coverage.rflupdated: its.sys.listen → nyiassertions relied on the test runner having no poll; valid ports now genuinely bind, so it asserts the success and double-bindiobranches instead.Fixes the reported async-IPC-from-server bug.