eval: resume-stack fixes (G18 Pinggy tunnel) + listener pre-download wedge fix by richardzhuang0412 · Pull Request #31 · open-thoughts/OpenThoughts-Agent

richardzhuang0412 · 2026-06-22T08:14:16Z

Follow-on eval-stack fixes landed after PR #27 merged, validated in production on JSC Jupiter.

Changes

G18 — resume-branch Pinggy tunnel (unified_eval_harbor.sbatch): the sbatch resume branch patched the Pinggy URL into trial configs but never started the tunnel (the tunnel block was fresh-fire-only), so Cat-3 installed/swe-agent resumes hit [Errno 111] Connection refused on every LLM call → 100% trial failure. Tunnel-start added to the resume branch, gated on EVAL_PINGGY_URL (terminus-2 resumes + all fresh fires untouched → zero blast radius). G18b additionally exports the tunnel env so sandbox agents reach vLLM.
Listener pre-download wedge (unified_eval_listener.py): the per-model HF pre-download wrapped snapshot_download in a with ThreadPoolExecutor() ... result(timeout=300) block; when the timeout fired, the with exit's shutdown(wait=True) re-blocked indefinitely on the still-running (un-killable) download thread — a single slow download hung an entire fire for hours (observed: 4 of 24 jobs submitted before wedging). Now drives the executor manually with shutdown(wait=False), and passes etag_timeout=120 + max_workers=8 so slow HF metadata fetches make progress instead of stalling on the default 10s serial etag retries.
Resume tooling + Cat-3 SERA: resume_chunked.py passes through --resume-error-threshold; check_progress.py API-mode banner parsing + resume detection; minimal baseline config gains the SERA Pattern A per-model entries; swe-agent harbor config tweaks.

Validation

G18 fix validated in production (574135/574136 SERA swe-agent resumes jumped from ~547 to 1499/1500 trials post-fix; clean Pinggy logs).
Pre-download fix addresses the 2026-06 ablation-batch wedge (manual parallel-predownload with the same etag_timeout/max_workers recovered it).

🤖 Generated with Claude Code

Preserve accumulated eval working-tree fixes so a fresh clone / pull doesn't lose them (notably the uncommitted G18 patch). - unified_eval_harbor.sbatch: G18 — start the Pinggy tunnel in the RESUME branch (gated on EVAL_PINGGY_URL/TOKEN, 6x connectivity-verify + FATAL-on-fail) so Cat-3 installed-agent (swe-agent) resumes don't fail 100% with Errno 111. Plus G10/G12 api_base port + per-trial config.json Pinggy/port rewrite on resume. - resume_chunked.py: orchestrator passthrough for --pinggy-url/--pinggy-token/ --config-yaml/--agent-parser; --max-resume-count plumbing. - unified_eval_listener.py: resume detection + API-mode model-name passthrough. - check_progress.py: resume tagging + API-mode banner parsing. - baseline_model_configs_minimal.yaml: SERA Pattern A per-model entries. - dcagent_eval_config_swe_agent.yaml: exclude_exceptions parity alignment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…h vLLM The G18 fix started the Pinggy tunnel on resume but never exported the OPENAI_BASE_URL/OPENAI_API_BASE env that the sandboxed agent reads with priority over the localhost api_base in config.json. The fresh-fire path exports these (~L1123-1165); the resume branch didn't — so swe-agent/openhands fell back to localhost:<port> inside the remote Daytona sandbox and every post-resume LLM call failed with [Errno 111], yielding empty patches / reward 0 on EVERY post-resume trial (pre-resume ~50% -> post ~0%). Fix: after the resume-branch tunnel verify, export TUNNEL_URL + OPENAI_API_KEY/ OPENAI_API_BASE/OPENAI_BASE_URL (+ openhands LLM_BASE_URL, mini-swe-agent MSWEA_*/HOSTED_VLLM_API_BASE; agent read from saved config.json). Gated inside the EVAL_PINGGY_URL/EVAL_PINGGY_TOKEN block -> no-op for terminus-2. Verified 2026-06-09 (refire 672921-672926): pinggy TC 54-68 (vs broken TC=2), post-resume accuracy back to pre-resume levels (~50% swebench, ~33% aider), no Errno 111. Note: the (resume) Connectivity verified check is a false positive (compute-node only) — confirm via TC-climbing + non-zero reward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The per-model HF pre-download wrapped snapshot_download in a `with ThreadPoolExecutor() ... future.result(timeout=300)` block. When the 300s timeout fired, exiting the `with` called shutdown(wait=True), which re-blocked indefinitely on the still-running (un-killable) download thread — so a single slow/large download hung the entire fire for hours (observed on the 2026-06 ablation batch: stalled on one model at 3/4 shards, only 4 of 24 jobs ever submitted). Fixes: - Drive the executor manually and shutdown(wait=False) in finally, so the wall timeout actually unblocks the loop. - Pass etag_timeout=120 + max_workers=8 to snapshot_download so slow HF metadata/file fetches make progress instead of stalling on the default 10s serial etag retries (matches the manual parallel-predownload workaround that recovered the wedge). - Bump the hard cap 300s -> 600s to accommodate large 32B checkpoints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces several enhancements to the evaluation pipeline, including parallelized log parsing in check_progress.py, updated model configurations, support for Pinggy tunnel setup and environment propagation during job resumes in unified_eval_harbor.sbatch, and robust model pre-downloading with hard timeouts in unified_eval_listener.py. The review feedback identifies several critical improvement opportunities, such as explicitly importing concurrent.futures to avoid a NameError, handling potentially unbound variables under set -u in bash scripts, avoiding bash syntax errors when checking PATCHED_TRIAL_COUNT, and using Python's built-in json module instead of yaml to parse JSON configurations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-22T08:16:35Z

+    if fire_log_files:
+        fi_workers = min(32, max(4, len(fire_log_files)))
+        with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec:


The code uses concurrent.futures.ThreadPoolExecutor but concurrent or concurrent.futures might not be imported in this file. To prevent a potential NameError at runtime, please import concurrent.futures explicitly.

Suggested change

if fire_log_files:

fi_workers = min(32, max(4, len(fire_log_files)))

with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec:

if fire_log_files:

import concurrent.futures

fi_workers = min(32, max(4, len(fire_log_files)))

with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec:

Thanks — false positive: concurrent.futures is already imported at module level (line 13), so there's no NameError risk here. No change needed.

gemini-code-assist · 2026-06-22T08:16:35Z

+    if [ -n "${EVAL_PINGGY_URL:-}" ]; then
+        echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1"
+        sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json"
+        if [ "$PATCHED_TRIAL_COUNT" -gt 0 ]; then
+            echo "Patching Pinggy URL in $PATCHED_TRIAL_COUNT per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1"
+            find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \
+                xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g"
+        fi
+    fi


The variable PATCHED_TRIAL_COUNT is used in the conditional check [ "$PATCHED_TRIAL_COUNT" -gt 0 ]. If this variable is unset or empty, it will cause a bash syntax error (integer expression expected). Since xargs -0 -r already safely handles empty inputs without executing the command, the conditional check is redundant. Additionally, the regular expression [a-zA-Z0-9]\+\.a\.pinggy\.link can be made more robust by using [^/]\+\.pinggy\.link to handle any Pinggy subdomain format (including those with hyphens or without the .a. subdomain).

Suggested change

if [ -n "${EVAL_PINGGY_URL:-}" ]; then

echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1"

sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json"

if [ "$PATCHED_TRIAL_COUNT" -gt 0 ]; then

echo "Patching Pinggy URL in $PATCHED_TRIAL_COUNT per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1"

find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \

xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g"

fi

fi

if [ -n "${EVAL_PINGGY_URL:-}" ]; then

echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1"

sed -i.bak-pinggy "s|\"api_base\": \"https://[^/]\+\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json"

echo "Patching Pinggy URL in per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1"

find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \

xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[^/]\+\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g"

fi

False positive on the syntax-error concern: PATCHED_TRIAL_COUNT is assigned just above (line 855) via find ... | wc -l, which always emits an integer (0 when none match), so the -gt 0 test is never run against an unset/empty value. On the regex: all our Pinggy persistent subdomains are <alnum>.a.pinggy.link (no hyphens), so the current pattern matches; broadening isn't needed for the URLs we issue. Leaving as-is.

gemini-code-assist · 2026-06-22T08:16:35Z

+        if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then
+            _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "
+        fi


If set -u (nounset) is enabled in the bash environment, referencing $PROXYCHAINS_BIN directly when it is unset or empty will cause the script to crash. Use ${PROXYCHAINS_BIN:-} to safely default to an empty string, consistent with how ${PROXYCHAINS_CONF_FILE:-} is handled in the same line.

Suggested change

if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then

_PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "

fi

if [ -x "${PROXYCHAINS_BIN:-}" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then

_PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "

fi

False positive: this script runs under set -eo pipefail (line 45), not set -u, so an unset var expands to empty rather than crashing. PROXYCHAINS_BIN is also always defined (line 251, ${EVAL_PROXYCHAINS_BIN:-}). The existing call sites (lines 340, 373) reference "$PROXYCHAINS_BIN" the same way, so this stays consistent with the file.

gemini-code-assist · 2026-06-22T08:16:35Z

+        export OPENAI_API_KEY="fake_key"
+        export OPENAI_API_BASE="${TUNNEL_URL}/v1"
+        export OPENAI_BASE_URL="${TUNNEL_URL}/v1"
+        _RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import yaml;c=yaml.safe_load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "")


The script uses yaml.safe_load to parse config.json. Since config.json is a standard JSON file, using Python's built-in json module is much more robust as it is guaranteed to be available in any Python environment, whereas yaml (PyYAML) is a third-party dependency that might not be installed.

Suggested change

_RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import yaml;c=yaml.safe_load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "")

_RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import json;c=json.load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "")

Minor/no-op in practice: PyYAML is present in the eval conda env and is already used to parse these configs elsewhere in this same script (lines 965, 1019), and YAML is a superset of JSON so yaml.safe_load reads config.json correctly. Switching only this line to json would make it inconsistent with those two existing reads. Happy to convert all three to json as a separate tidy if preferred, but leaving as-is for now.

gemini-code-assist · 2026-06-22T08:16:35Z

+            if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then
+                _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "
+            fi


If set -u (nounset) is enabled, referencing $PROXYCHAINS_BIN directly when it is unset will cause a crash. Use ${PROXYCHAINS_BIN:-} to safely default to an empty string.

Suggested change

if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then

_PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "

fi

if [ -x "${PROXYCHAINS_BIN:-}" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then

_PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE "

fi

Same as the line-890 thread: the script is set -eo pipefail (not set -u) and PROXYCHAINS_BIN is always defined (line 251), so no crash risk. Kept consistent with the existing [ -x "$PROXYCHAINS_BIN" ] call sites.

richardzhuang0412 and others added 3 commits June 22, 2026 10:13

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: resume-stack fixes (G18 Pinggy tunnel) + listener pre-download wedge fix#31

eval: resume-stack fixes (G18 Pinggy tunnel) + listener pre-download wedge fix#31
richardzhuang0412 wants to merge 3 commits into
mainfrom
richard/eval-resume-stack-followups

richardzhuang0412 commented Jun 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

richardzhuang0412 Jun 22, 2026

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

richardzhuang0412 Jun 22, 2026

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

richardzhuang0412 Jun 22, 2026

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

richardzhuang0412 Jun 22, 2026

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

richardzhuang0412 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import yaml;c=yaml.safe_load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null \|\| echo "")
	_RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import json;c=json.load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null \|\| echo "")

Conversation

richardzhuang0412 commented Jun 22, 2026

Changes

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

richardzhuang0412 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

richardzhuang0412 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

richardzhuang0412 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

richardzhuang0412 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

richardzhuang0412 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant