Add Harbor/Terminal-Bench integration via Daytona and local Docker by doxav · Pull Request #15 · AgentOpt/Trace-Bench

doxav · 2026-05-08T13:50:15Z

This PR introduces support for Terminal-Bench tasks via the Harbor CLI. Because Terminal-Bench evaluates an agent's ability to execute shell commands and modify system states, these tasks require strict sandboxing to prevent agents from inadvertently damaging the host system or accessing sensitive local files.

When using Daytona service ?
By default, running these benchmarks in cloud notebook environments (like Google Colab) requires nested virtualization or running a local Docker daemon, which is not possible on Google Colab.
We integrated Daytona (harbor_env: daytona) as a remote execution backend because:

=> It allows us to offload the sandboxed environments securely without needing a local Docker daemon.
=> Security & Isolation: Agents perform potentially destructive system interventions safely away from the orchestrator.
=> Consistency: Ensure reproducibility across runs regardless of where the trace_bench script is actually executing.

How to run locally (Without Daytona) ?
If you are running trace_bench locally on a machine with Docker installed, you do not need a Daytona API key or service. You can simply switch the Harbor environment to use the local Docker daemon.

To do this, configure harbor_env: docker instead of daytona in your evaluation configuration:

tasks:
  - id: terminal_bench:regex-log
      eval_kwargs:
            harbor_dataset: terminal-bench@2.0
                  # Use "docker" for local execution, or "daytona" for cloud execution
                  harbor_env: docker
                  harbor_model: openrouter/openai/gpt-4o-mini

When local execution is triggered, Harbor will build and spin up isolated Docker containers directly on your machine.

allenanie · 2026-05-08T16:21:32Z

This is so amazing -- I've been thinking if we should process all our data into Harbor format!

allenanie · 2026-05-08T16:35:12Z

Also I'm happy to help integrate modal/daytona completely into all tasks

doxav · 2026-06-10T20:30:56Z

@allenanie have you tested it ? Do you want to extend it first to all tasks or we can merge this and do it in a second step ?

allenanie · 2026-06-11T19:51:29Z

Interesting -- you mean run all other tasks in Daytona docker?

That's actually kinda nice -- depending on how hard/easy it is to do -- if you think it will take more systematic refactoring, and you've fully onboarded/tested the TB2 -- then let's merge it?

doxav · 2026-06-11T20:31:13Z

@allenanie Yes, I think there may have been a small misunderstanding. I didn’t mean this PR runs all Trace-Bench tasks inside Daytona. What it currently does is limited to Harbor / Terminal-Bench 2.0 path, it lets us choose the Harbor execution backend, either local Docker or hosted Daytona, which is useful for Colab / environments without Docker and for scaling isolated TB2 runs.

Extending Daytona/Modal to all Trace-Bench tasks sounds useful, but I think that is a separate remote-execution layer rather than a small extension of this PR. We would need to decide whether to wrap whole Trace-Bench jobs in Daytona/Modal, or translate more tasks into Harbor-compatible tasks, and that likely touches runner/config/artifacts/secrets/concurrency.

So my preference is: merge this PR if you could test it (it worked on my side locally and on colab), then open a follow-up PR/issue for the broader “remote execution for all tasks” design. Happy to work with you on that next step.

doxav · 2026-06-12T19:04:36Z

@copilot resolve the merge conflicts in this pull request

feat: add harbor/terminal_bench integration along with initial tests

398ace7

doxav force-pushed the harbor branch from d6a9c19 to 398ace7 Compare June 13, 2026 10:25

doxav merged commit 2bc13c6 into AgentOpt:main Jun 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Harbor/Terminal-Bench integration via Daytona and local Docker#15

Add Harbor/Terminal-Bench integration via Daytona and local Docker#15
doxav merged 1 commit into
AgentOpt:mainfrom
doxav:harbor

doxav commented May 8, 2026

Uh oh!

allenanie commented May 8, 2026

Uh oh!

allenanie commented May 8, 2026

Uh oh!

doxav commented Jun 10, 2026

Uh oh!

allenanie commented Jun 11, 2026

Uh oh!

doxav commented Jun 11, 2026

Uh oh!

doxav commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

doxav commented May 8, 2026

Uh oh!

allenanie commented May 8, 2026

Uh oh!

allenanie commented May 8, 2026

Uh oh!

doxav commented Jun 10, 2026

Uh oh!

allenanie commented Jun 11, 2026

Uh oh!

doxav commented Jun 11, 2026

Uh oh!

doxav commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants