Add Harbor/Terminal-Bench integration via Daytona and local Docker#15
Conversation
|
This is so amazing -- I've been thinking if we should process all our data into Harbor format! |
|
Also I'm happy to help integrate modal/daytona completely into all tasks |
|
@allenanie have you tested it ? Do you want to extend it first to all tasks or we can merge this and do it in a second step ? |
|
Interesting -- you mean run all other tasks in Daytona docker? That's actually kinda nice -- depending on how hard/easy it is to do -- if you think it will take more systematic refactoring, and you've fully onboarded/tested the TB2 -- then let's merge it? |
|
@allenanie Yes, I think there may have been a small misunderstanding. I didn’t mean this PR runs all Trace-Bench tasks inside Daytona. What it currently does is limited to Harbor / Terminal-Bench 2.0 path, it lets us choose the Harbor execution backend, either local Docker or hosted Daytona, which is useful for Colab / environments without Docker and for scaling isolated TB2 runs. Extending Daytona/Modal to all Trace-Bench tasks sounds useful, but I think that is a separate remote-execution layer rather than a small extension of this PR. We would need to decide whether to wrap whole Trace-Bench jobs in Daytona/Modal, or translate more tasks into Harbor-compatible tasks, and that likely touches runner/config/artifacts/secrets/concurrency. So my preference is: merge this PR if you could test it (it worked on my side locally and on colab), then open a follow-up PR/issue for the broader “remote execution for all tasks” design. Happy to work with you on that next step. |
|
@copilot resolve the merge conflicts in this pull request |
This PR introduces support for Terminal-Bench tasks via the Harbor CLI. Because Terminal-Bench evaluates an agent's ability to execute shell commands and modify system states, these tasks require strict sandboxing to prevent agents from inadvertently damaging the host system or accessing sensitive local files.
When using Daytona service ?
By default, running these benchmarks in cloud notebook environments (like Google Colab) requires nested virtualization or running a local Docker daemon, which is not possible on Google Colab.
We integrated Daytona (harbor_env: daytona) as a remote execution backend because:
=> It allows us to offload the sandboxed environments securely without needing a local Docker daemon.
=> Security & Isolation: Agents perform potentially destructive system interventions safely away from the orchestrator.
=> Consistency: Ensure reproducibility across runs regardless of where the trace_bench script is actually executing.
How to run locally (Without Daytona) ?
If you are running trace_bench locally on a machine with Docker installed, you do not need a Daytona API key or service. You can simply switch the Harbor environment to use the local Docker daemon.
To do this, configure harbor_env: docker instead of daytona in your evaluation configuration:
When local execution is triggered, Harbor will build and spin up isolated Docker containers directly on your machine.