Skip to content

ci: trigger dogfood eval pipeline on run-evals PR label#428

Closed
calvarjorge wants to merge 4 commits into
mainfrom
jorge_calvar/eval_trigger
Closed

ci: trigger dogfood eval pipeline on run-evals PR label#428
calvarjorge wants to merge 4 commits into
mainfrom
jorge_calvar/eval_trigger

Conversation

@calvarjorge

Copy link
Copy Markdown
Contributor

What

Adds a GitHub Actions workflow that launches the dogfood eval pipeline (job 398185277057549) for a PR when the run-evals label is present, and re-launches it on every new commit while the label stays on.

How it works

  • Trigger: pull_request with types: [labeled, synchronize].
    • labeled + label is run-evals → run.
    • synchronize (new commit) + PR already has run-evals → run.
  • Commit: passes github.event.pull_request.head.sha (the real PR head commit, never the synthetic merge commit) as the appkit_ref job param, so the pipeline can pull the code. Also sets prompt_preset=custom-pr and tags=appkit_pr:<number>.
  • Latest wins: concurrency with cancel-in-progress: true (grouped per PR) guarantees the sticky ⏳ Eval running comment always reflects the most recently triggered commit, even if an earlier run's job finishes first.
  • Comment: a sticky comment (.github/scripts/upsert-eval-comment.cjs) links the evals-monitor PR page (/prs/appkit/<number>) and the triggered job run (run_id from the run-now response — no extra API call).
  • Auth: OAuth M2M as the apps-mcp-evals-runner service principal. Credentials are scoped to the trigger step only, so the PR-authored comment script never sees them.

Security notes

  • Uses pull_request (not pull_request_target): repo secrets are withheld from fork PRs, so an external contributor cannot exfil the credentials even by editing the workflow. The run-evals label gate also requires write/triage access.
  • Action SHAs are pinned.

Required setup before this works

  1. Create the run-evals label in the repo.
  2. Generate an OAuth secret on the apps-mcp-evals-runner SP and add two repo secrets:
    • EVALS_DATABRICKS_CLIENT_ID_DOGFOOD
    • EVALS_DATABRICKS_CLIENT_SECRET_DOGFOOD
  3. Grant the SP CAN MANAGE RUN on job 398185277057549.

Testing

Because this is a pull_request-triggered workflow, it runs the PR branch's version of the workflow, so it can be exercised on this PR once the label and secrets exist.

This pull request and its description were written by Isaac.

Add a GitHub Actions workflow that launches the dogfood eval pipeline
(job 398185277057549) when the `run-evals` label is added to a PR, and
re-launches it on each new commit while the label stays on. The real PR
head commit is passed as `appkit_ref` so the pipeline can pull the code;
`prompt_preset=custom-pr` and `tags=appkit_pr:<number>` are also set.

Authenticates as the apps-mcp-evals-runner service principal via OAuth
M2M, and posts a sticky "Eval running" comment linking the evals-monitor
PR page and the triggered job run. Comment logic lives in
.github/scripts/upsert-eval-comment.cjs.

Co-authored-by: Isaac
Signed-off-by: Jorge Calvar <jorge.calvar@databricks.com>
@calvarjorge calvarjorge added the run-evals Run the dogfood eval pipeline on this PR label Jun 9, 2026
Probe dogfood reachability + workspace OIDC discovery and run a forced
oauth-m2m authenticated call with debug logging, to pin down the "cannot
configure default credentials" failure. To be reverted once auth works.

Co-authored-by: Isaac
Signed-off-by: Jorge Calvar <jorge.calvar@databricks.com>
Allow manual runs to probe staging connectivity from an arbitrary runner
group (runner_group/runner_labels inputs), since dogfood.staging blocks
the default databricks-protected-runner-group at the network edge. A bare
dispatch runs only the diagnostic; pass pr_number to also trigger the job
and post the comment.

Co-authored-by: Isaac
Signed-off-by: Jorge Calvar <jorge.calvar@databricks.com>
The databricks-protected-runner-group's egress to internal Databricks
hosts is gated by the GitHub OIDC identity. Without `id-token: write` the
egress proxy returns 403 "RBAC: access denied" for every request (incl.
anonymous curl to dogfood.staging), which is what broke OAuth M2M. All
other Databricks workflows in this repo set this permission.

Also revert the temporary manual-dispatch/configurable-runner testing
scaffolding; back to label/synchronize on databricks-protected-runner-group.

Co-authored-by: Isaac
Signed-off-by: Jorge Calvar <jorge.calvar@databricks.com>
@calvarjorge

Copy link
Copy Markdown
Contributor Author

Closing for now — GitHub Actions runners can't reach dogfood.staging (network perimeter), so we can't trigger the eval job directly from CI. Keeping the branch jorge_calvar/eval_trigger in case we revisit this with a staging-capable runner or inverted trigger. Replacing with a lightweight approach: post a link to the evals-monitor PR page where the eval can be started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-evals Run the dogfood eval pipeline on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant