Skip to content

fix: fall back to list_assignments when KeepAliveAssignment RPC fails with auth error#60

Closed
merlinran wants to merge 1 commit into
googlecolab:mainfrom
merlinran:fix/disconnect
Closed

fix: fall back to list_assignments when KeepAliveAssignment RPC fails with auth error#60
merlinran wants to merge 1 commit into
googlecolab:mainfrom
merlinran:fix/disconnect

Conversation

@merlinran

Copy link
Copy Markdown
Contributor

Summary

  • The KeepAliveAssignment gRPC-web RPC at colab.pa.googleapis.com requires serviceusage.services.use on the consumer project set via X-Goog-User-Project. OAuth2 and ADC user credentials outside a browser typically lack this permission, causing every keep-alive request to fail with 403 USER_PROJECT_DENIED. The keep-alive daemon then exits after 2 consecutive failures (60 seconds), and the Colab backend idles the VM ~7-8 minutes later.
  • This PR catches the auth-related error codes (USER_PROJECT_DENIED, CONSUMER_INVALID, SERVICE_DISABLED) and falls back to list_assignments on the Colab web frontend — which authenticates with the user's OAuth/ADC token alone and doubles as an implicit keep-alive.
  • If the endpoint is absent from the assignments response, a synthetic 404 is raised so the daemon treats it as terminal and exits cleanly.

Test plan

  • Created a session, confirmed keep_alive_started appears without keep_alive_error in history
  • Waited 65+ seconds, confirmed no subsequent keep-alive errors and daemon still alive via colab status
  • All 224 existing tests pass

… with auth error

The KeepAliveAssignment gRPC-web RPC requires the caller to hold
serviceusage.services.use on Colab's project (1014160490159), which
OAuth2/ADC user credentials outside a browser typically lack. When
the RPC returns USER_PROJECT_DENIED, CONSUMER_INVALID, or
SERVICE_DISABLED, fall back to list_assignments — a call to the
Colab web frontend that uses the same OAuth/ADC token and doubles
as an implicit keep-alive. If the endpoint is absent from the
assignments response, raise a synthetic 404 so the daemon treats
it as a terminal error and exits cleanly.
@Sunwood-ai-labs

Copy link
Copy Markdown

I tested this PR branch locally against a real Colab session, and it looks promising.

Environment / checkout:

Validation:

  • Created a real Colab session with the PR branch
  • Keep-alive daemon started successfully
  • Ran a 30-minute live soak
  • Checked every 5 minutes at 300s, 600s, 900s, 1200s, 1500s, and 1800s
  • At every checkpoint, the daemon was still alive
  • KEEP error count stayed 0
  • Final log only showed session_created and KEEP: started, with no KEEP: error
  • Cleaned up with colab stop
  • Final colab sessions returned: No active sessions found on server

Result:
This PR branch passed a 30-minute live keep-alive validation in my environment. That crosses the several-minute failure/prune window I was seeing before.

I would still treat this as branch-level validation, not a released-package fix yet, but it is strong positive evidence for the approach.

@teeler

teeler commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Clever but no thank you ;)

there is an actual keep-alive call elsewhere, we don't need to overload list-assignments.

#61

Also, please see https://github.com/googlecolab/google-colab-cli/blob/main/CONTRIBUTING.md - especially on topics like this. I'd much rather have a short discussion and notify folks that we're already working on it rather than have folks spend brain effort (or tokens) on problems that we're already tackling.

@teeler teeler closed this Jun 15, 2026
@merlinran

Copy link
Copy Markdown
Contributor Author

Thank you for the reminder and fix! This was an accident - I needed to have a quick fix for myself, and the agent sent the PR following what I asked it to do for another case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants