Skip to content

Fix signaling races during peer connection setup#364

Open
erikdubbelboer wants to merge 1 commit into
mainfrom
lobby-join-race-conditions
Open

Fix signaling races during peer connection setup#364
erikdubbelboer wants to merge 1 commit into
mainfrom
lobby-join-race-conditions

Conversation

@erikdubbelboer

@erikdubbelboer erikdubbelboer commented Jun 6, 2026

Copy link
Copy Markdown
Member

Serialize incoming signaling messages so WebRTC descriptions and candidates are handled in order. Queue ICE candidates until a remote description is available, ignore stale answers that no longer match the current signaling state, and avoid creating duplicate Peer instances while async TURN credential lookup is in flight.

On the server, have JoinLobby return the peers that existed before the join and only request connections to that snapshot. This avoids racing newly joined peers against each other while the lobby membership changes.

Credential request responses can still resolve immediately because _addPeer may wait for them while handling a queued connect packet.

There are no new tests as these race conditions aren't easy to replicate with simple tests. I do have a test framework I used to find them, but this runs for several minutes and uses multiple headless Chrome browsers. I'm cleaning up this code for a separate pull request.

Serialize incoming signaling messages so WebRTC descriptions and candidates are
handled in order. Queue ICE candidates until a remote description is available,
ignore stale answers that no longer match the current signaling state, and avoid
creating duplicate Peer instances while async TURN credential lookup is in
flight.

On the server, have JoinLobby return the peers that existed before the join and
only request connections to that snapshot. This avoids racing newly joined peers
against each other while the lobby membership changes.

Credential request responses can still resolve immediately because _addPeer may
wait for them while handling a queued connect packet.
Comment thread lib/peer.ts
private makingOffer: boolean = false
private ignoreOffer: boolean = false
private isSettingRemoteAnswerPending: boolean = false
private readonly pendingCandidates: RTCIceCandidate[] = []

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private readonly pendingCandidates: RTCIceCandidate[] = []
private readonly pendingRemoteCandidates: RTCIceCandidate[] = []

I would always include local/remote prefixes for these kinds of state.

@erikdubbelboer

Copy link
Copy Markdown
Member Author

Issue 1: New lobby joiners could both initiate the same peer connection

Before the change:

  1. Peer A is already in a lobby.
  2. Peers B and C join that lobby at nearly the same time.
  3. JoinLobby itself is serialized correctly with SELECT ... FOR UPDATE, so the database update is safe:
    • B appends itself.
    • C appends itself after B.
  4. But after joining, HandleJoinPacket called GetLobby.
  5. GetLobby happens after the transaction is committed, outside the join lock.
  6. So B might join when only A existed, but then call GetLobby after C has also joined.
  7. B now sees [A, B, C] and requests a connection to both A and C.
  8. C also sees [A, B, C] and requests a connection to both A and B.
  9. Now both B and C can initiate a connection to each other.

That matters because RequestConnection assigns roles per initiated pair: the joiner gets one connect packet with polite: true, and the existing peer gets one with polite: false. If both peers initiate, each side can receive conflicting duplicate connect packets for the same peer ID.

The fix is that JoinLobby now returns the peer list captured under the row lock, before appending the new peer. Then HandleJoinPacket only calls RequestConnection for that snapshot.

@erikdubbelboer

Copy link
Copy Markdown
Member Author

Issue 2: Client signaling handlers could interleave and corrupt WebRTC setup

Before the change, every websocket message started handleSignalingMessage(...) independently. The websocket delivers messages in order, but the handler is async, so message handling could overlap.

A bad sequence looked like this:

  1. Client receives a connect packet for peer B.
  2. connect handling calls _addPeer.
  3. _addPeer waits for TURN credentials before constructing the Peer.
  4. While that await is still pending, more websocket messages arrive.
  5. A description or candidate message for B starts handling concurrently.
  6. Depending on timing, the peer might not exist yet, the remote description might not be set yet, or two concurrent connect handlers might both try to create a peer for the same ID.

That led to a few concrete failure modes:

  1. ICE candidates could be applied before setRemoteDescription finished. Browsers reject that with a remote-description-null or invalid-state error.
  2. Duplicate connect packets could race through _addPeer; both would see no peer before awaiting credentials, then both could create Peer instances. The later one overwrote the map entry, while the older one could still have event handlers and pending WebRTC work.
  3. Delayed answers could be applied after the connection was no longer in have-local-offer. setRemoteDescription(answer) is only valid while there is a matching local offer, so stale answers could throw.
  4. Once the message queue was added, a new self-dependency appeared: connect waits for _addPeer, _addPeer waits for TURN credentials, and the credential response comes through the same websocket queue. Without the immediate credential-response bypass, the response would sit behind the connect handler that is waiting for it.

The fix addresses those pieces together:

  1. Incoming signaling messages are serialized through messageQueue, so a description handler can finish before the next candidate handler runs.
  2. ICE candidates are still queued inside Peer if they arrive before remoteDescription, then drained after setRemoteDescription.
  3. _addPeer checks this.peers.has(id) before and after credential lookup, so a peer created while another _addPeer was awaiting credentials prevents duplicate construction.
  4. Stale answers are ignored unless the connection is actually in have-local-offer.
    Credential request responses resolve immediately instead of waiting for the signaling queue, because they are needed to unblock queued connect handling.

@koenbollen

Copy link
Copy Markdown
Collaborator

As discussed in voice. Let's try and find a metric in the analytics that can verify that this change is going to be a positive one.

@erikdubbelboer

Copy link
Copy Markdown
Member Author

@koenbollen

For issue 1 we can track the number of double connection attempts. When both peers join the lobby at the same time they both will initiate the bi-directional connections which results in 4 attempt events instead of the usual 2.

WITH attempts AS (
  SELECT
    DATE(time) AS day,
    version,
    game,
    lobby,
    peer,
    (SELECT value FROM UNNEST(data) WHERE key = "target") AS target
  FROM poki-insights-production.raw_dynamic.netlib_analytics
  WHERE DATE(time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    AND category = "rtc"
    AND action = "attempt"
),
pairs AS (
  SELECT
    day,
    version,
    game,
    lobby,
    LEAST(peer, target) AS p1,
    GREATEST(peer, target) AS p2,
    COUNT(*) AS attempt_events
  FROM attempts
  WHERE target IS NOT NULL AND peer != target
  GROUP BY ALL
)
SELECT
  day,
  version,
  COUNT(*) AS attempted_pairs,
  COUNTIF(attempt_events > 2) AS duplicate_attempt_pairs
FROM pairs
GROUP BY day, version
ORDER BY day DESC, version ASC
Screenshot 2026-06-09 at 04 52 33

@erikdubbelboer

Copy link
Copy Markdown
Member Author

And for issue 2 we should be able to see an decrease in the amount of errors we track:

WITH events AS (
  SELECT
    DATE(time) AS day,
    version,
    category,
    action,
    peer,
    COALESCE((SELECT value FROM UNNEST(data) WHERE key = "error"), "") AS error_payload
  FROM poki-insights-production.raw_dynamic.netlib_analytics
  WHERE DATE(time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    AND ((category = "signaling" AND action = "error") OR (category = "rtc" AND action = "attempt"))
),
daily AS (
  SELECT
    day,
    version,

    COUNTIF(category = "rtc" AND action = "attempt") AS rtc_attempts,

    COUNTIF(
      category = "signaling"
      AND action = "error"
      AND REGEXP_CONTAINS(error_payload, r"addIceCandidate.*remote description was null|The remote description was null")
    ) AS remote_description_null_errors,

    COUNTIF(
      category = "signaling"
      AND action = "error"
      AND REGEXP_CONTAINS(error_payload, r"setRemoteDescription.*Called in wrong state|Failed to set remote answer sdp: Called in wrong state")
    ) AS wrong_state_answer_errors,

    COUNTIF(
      category = "signaling"
      AND action = "error"
      AND REGEXP_CONTAINS(error_payload, r"setRemoteDescription.*SSL role|set SSL role")
    ) AS ssl_role_errors,

    COUNTIF(
      category = "signaling"
      AND action = "error"
      AND REGEXP_CONTAINS(error_payload, r"addIceCandidate.*Error processing ICE candidate")
    ) AS ice_candidate_processing_errors,

    COUNTIF(
      category = "signaling"
      AND action = "error"
      AND REGEXP_CONTAINS(
        error_payload,
        r"addIceCandidate.*remote description was null|The remote description was null|setRemoteDescription.*Called in wrong state|Failed to set remote answer sdp: Called in wrong state|setRemoteDescription.*SSL role|set SSL role|addIceCandidate.*Error processing ICE candidate"
      )
    ) AS setup_race_errors
  FROM events
  GROUP BY ALL
)
SELECT
  day,
  version,
  rtc_attempts,
  setup_race_errors,
  remote_description_null_errors,
  wrong_state_answer_errors,
  ssl_role_errors,
  ice_candidate_processing_errors
FROM daily
WHERE rtc_attempts > 0
ORDER BY day DESC, rtc_attempts DESC;
Screenshot 2026-06-09 at 04 59 22

@koenbollen

Copy link
Copy Markdown
Collaborator

Nice work, with that I think we can set a game live with this. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants