eval: resume-stack fixes (G18 Pinggy tunnel) + listener pre-download wedge fix#31
eval: resume-stack fixes (G18 Pinggy tunnel) + listener pre-download wedge fix#31richardzhuang0412 wants to merge 3 commits into
Conversation
Preserve accumulated eval working-tree fixes so a fresh clone / pull doesn't lose them (notably the uncommitted G18 patch). - unified_eval_harbor.sbatch: G18 — start the Pinggy tunnel in the RESUME branch (gated on EVAL_PINGGY_URL/TOKEN, 6x connectivity-verify + FATAL-on-fail) so Cat-3 installed-agent (swe-agent) resumes don't fail 100% with Errno 111. Plus G10/G12 api_base port + per-trial config.json Pinggy/port rewrite on resume. - resume_chunked.py: orchestrator passthrough for --pinggy-url/--pinggy-token/ --config-yaml/--agent-parser; --max-resume-count plumbing. - unified_eval_listener.py: resume detection + API-mode model-name passthrough. - check_progress.py: resume tagging + API-mode banner parsing. - baseline_model_configs_minimal.yaml: SERA Pattern A per-model entries. - dcagent_eval_config_swe_agent.yaml: exclude_exceptions parity alignment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h vLLM The G18 fix started the Pinggy tunnel on resume but never exported the OPENAI_BASE_URL/OPENAI_API_BASE env that the sandboxed agent reads with priority over the localhost api_base in config.json. The fresh-fire path exports these (~L1123-1165); the resume branch didn't — so swe-agent/openhands fell back to localhost:<port> inside the remote Daytona sandbox and every post-resume LLM call failed with [Errno 111], yielding empty patches / reward 0 on EVERY post-resume trial (pre-resume ~50% -> post ~0%). Fix: after the resume-branch tunnel verify, export TUNNEL_URL + OPENAI_API_KEY/ OPENAI_API_BASE/OPENAI_BASE_URL (+ openhands LLM_BASE_URL, mini-swe-agent MSWEA_*/HOSTED_VLLM_API_BASE; agent read from saved config.json). Gated inside the EVAL_PINGGY_URL/EVAL_PINGGY_TOKEN block -> no-op for terminus-2. Verified 2026-06-09 (refire 672921-672926): pinggy TC 54-68 (vs broken TC=2), post-resume accuracy back to pre-resume levels (~50% swebench, ~33% aider), no Errno 111. Note: the (resume) Connectivity verified check is a false positive (compute-node only) — confirm via TC-climbing + non-zero reward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-model HF pre-download wrapped snapshot_download in a `with ThreadPoolExecutor() ... future.result(timeout=300)` block. When the 300s timeout fired, exiting the `with` called shutdown(wait=True), which re-blocked indefinitely on the still-running (un-killable) download thread — so a single slow/large download hung the entire fire for hours (observed on the 2026-06 ablation batch: stalled on one model at 3/4 shards, only 4 of 24 jobs ever submitted). Fixes: - Drive the executor manually and shutdown(wait=False) in finally, so the wall timeout actually unblocks the loop. - Pass etag_timeout=120 + max_workers=8 to snapshot_download so slow HF metadata/file fetches make progress instead of stalling on the default 10s serial etag retries (matches the manual parallel-predownload workaround that recovered the wedge). - Bump the hard cap 300s -> 600s to accommodate large 32B checkpoints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces several enhancements to the evaluation pipeline, including parallelized log parsing in check_progress.py, updated model configurations, support for Pinggy tunnel setup and environment propagation during job resumes in unified_eval_harbor.sbatch, and robust model pre-downloading with hard timeouts in unified_eval_listener.py. The review feedback identifies several critical improvement opportunities, such as explicitly importing concurrent.futures to avoid a NameError, handling potentially unbound variables under set -u in bash scripts, avoiding bash syntax errors when checking PATCHED_TRIAL_COUNT, and using Python's built-in json module instead of yaml to parse JSON configurations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if fire_log_files: | ||
| fi_workers = min(32, max(4, len(fire_log_files))) | ||
| with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec: |
There was a problem hiding this comment.
The code uses concurrent.futures.ThreadPoolExecutor but concurrent or concurrent.futures might not be imported in this file. To prevent a potential NameError at runtime, please import concurrent.futures explicitly.
| if fire_log_files: | |
| fi_workers = min(32, max(4, len(fire_log_files))) | |
| with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec: | |
| if fire_log_files: | |
| import concurrent.futures | |
| fi_workers = min(32, max(4, len(fire_log_files))) | |
| with concurrent.futures.ThreadPoolExecutor(max_workers=fi_workers) as fi_exec: |
There was a problem hiding this comment.
Thanks — false positive: concurrent.futures is already imported at module level (line 13), so there's no NameError risk here. No change needed.
| if [ -n "${EVAL_PINGGY_URL:-}" ]; then | ||
| echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1" | ||
| sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json" | ||
| if [ "$PATCHED_TRIAL_COUNT" -gt 0 ]; then | ||
| echo "Patching Pinggy URL in $PATCHED_TRIAL_COUNT per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1" | ||
| find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \ | ||
| xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The variable PATCHED_TRIAL_COUNT is used in the conditional check [ "$PATCHED_TRIAL_COUNT" -gt 0 ]. If this variable is unset or empty, it will cause a bash syntax error (integer expression expected). Since xargs -0 -r already safely handles empty inputs without executing the command, the conditional check is redundant. Additionally, the regular expression [a-zA-Z0-9]\+\.a\.pinggy\.link can be made more robust by using [^/]\+\.pinggy\.link to handle any Pinggy subdomain format (including those with hyphens or without the .a. subdomain).
| if [ -n "${EVAL_PINGGY_URL:-}" ]; then | |
| echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1" | |
| sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json" | |
| if [ "$PATCHED_TRIAL_COUNT" -gt 0 ]; then | |
| echo "Patching Pinggy URL in $PATCHED_TRIAL_COUNT per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1" | |
| find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \ | |
| xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[a-zA-Z0-9]\+\.a\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" | |
| fi | |
| fi | |
| if [ -n "${EVAL_PINGGY_URL:-}" ]; then | |
| echo "Patching api_base Pinggy URL in config.json: -> https://${EVAL_PINGGY_URL}/v1" | |
| sed -i.bak-pinggy "s|\"api_base\": \"https://[^/]\+\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" "$EXISTING_JOB_DIR/config.json" | |
| echo "Patching Pinggy URL in per-trial config.json files: -> https://${EVAL_PINGGY_URL}/v1" | |
| find "$EXISTING_JOB_DIR" -mindepth 2 -maxdepth 2 -name config.json -print0 2>/dev/null | \ | |
| xargs -0 -r sed -i.bak-pinggy "s|\"api_base\": \"https://[^/]\+\.pinggy\.link/v1\"|\"api_base\": \"https://${EVAL_PINGGY_URL}/v1\"|g" | |
| fi |
There was a problem hiding this comment.
False positive on the syntax-error concern: PATCHED_TRIAL_COUNT is assigned just above (line 855) via find ... | wc -l, which always emits an integer (0 when none match), so the -gt 0 test is never run against an unset/empty value. On the regex: all our Pinggy persistent subdomains are <alnum>.a.pinggy.link (no hyphens), so the current pattern matches; broadening isn't needed for the URLs we issue. Leaving as-is.
| if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | ||
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | ||
| fi |
There was a problem hiding this comment.
If set -u (nounset) is enabled in the bash environment, referencing $PROXYCHAINS_BIN directly when it is unset or empty will cause the script to crash. Use ${PROXYCHAINS_BIN:-} to safely default to an empty string, consistent with how ${PROXYCHAINS_CONF_FILE:-} is handled in the same line.
| if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | |
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | |
| fi | |
| if [ -x "${PROXYCHAINS_BIN:-}" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | |
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | |
| fi |
There was a problem hiding this comment.
False positive: this script runs under set -eo pipefail (line 45), not set -u, so an unset var expands to empty rather than crashing. PROXYCHAINS_BIN is also always defined (line 251, ${EVAL_PROXYCHAINS_BIN:-}). The existing call sites (lines 340, 373) reference "$PROXYCHAINS_BIN" the same way, so this stays consistent with the file.
| export OPENAI_API_KEY="fake_key" | ||
| export OPENAI_API_BASE="${TUNNEL_URL}/v1" | ||
| export OPENAI_BASE_URL="${TUNNEL_URL}/v1" | ||
| _RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import yaml;c=yaml.safe_load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "") |
There was a problem hiding this comment.
The script uses yaml.safe_load to parse config.json. Since config.json is a standard JSON file, using Python's built-in json module is much more robust as it is guaranteed to be available in any Python environment, whereas yaml (PyYAML) is a third-party dependency that might not be installed.
| _RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import yaml;c=yaml.safe_load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "") | |
| _RESUME_AGENT_NAME=$(${PYTHON_BIN:-python3} -c "import json;c=json.load(open('$EXISTING_JOB_DIR/config.json'));a=(c.get('agents') or [{}]);print((a[0] or {}).get('name',''))" 2>/dev/null || echo "") |
There was a problem hiding this comment.
Minor/no-op in practice: PyYAML is present in the eval conda env and is already used to parse these configs elsewhere in this same script (lines 965, 1019), and YAML is a superset of JSON so yaml.safe_load reads config.json correctly. Switching only this line to json would make it inconsistent with those two existing reads. Happy to convert all three to json as a separate tidy if preferred, but leaving as-is for now.
| if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | ||
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | ||
| fi |
There was a problem hiding this comment.
If set -u (nounset) is enabled, referencing $PROXYCHAINS_BIN directly when it is unset will cause a crash. Use ${PROXYCHAINS_BIN:-} to safely default to an empty string.
| if [ -x "$PROXYCHAINS_BIN" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | |
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | |
| fi | |
| if [ -x "${PROXYCHAINS_BIN:-}" ] && [ -n "${PROXYCHAINS_CONF_FILE:-}" ]; then | |
| _PINGGY_PROXY_PREFIX="$PROXYCHAINS_BIN -f $PROXYCHAINS_CONF_FILE " | |
| fi |
There was a problem hiding this comment.
Same as the line-890 thread: the script is set -eo pipefail (not set -u) and PROXYCHAINS_BIN is always defined (line 251), so no crash risk. Kept consistent with the existing [ -x "$PROXYCHAINS_BIN" ] call sites.
Follow-on eval-stack fixes landed after PR #27 merged, validated in production on JSC Jupiter.
Changes
unified_eval_harbor.sbatch): the sbatch resume branch patched the Pinggy URL into trial configs but never started the tunnel (the tunnel block was fresh-fire-only), so Cat-3 installed/swe-agent resumes hit[Errno 111] Connection refusedon every LLM call → 100% trial failure. Tunnel-start added to the resume branch, gated onEVAL_PINGGY_URL(terminus-2 resumes + all fresh fires untouched → zero blast radius). G18b additionally exports the tunnel env so sandbox agents reach vLLM.unified_eval_listener.py): the per-model HF pre-download wrappedsnapshot_downloadin awith ThreadPoolExecutor() ... result(timeout=300)block; when the timeout fired, thewithexit'sshutdown(wait=True)re-blocked indefinitely on the still-running (un-killable) download thread — a single slow download hung an entire fire for hours (observed: 4 of 24 jobs submitted before wedging). Now drives the executor manually withshutdown(wait=False), and passesetag_timeout=120+max_workers=8so slow HF metadata fetches make progress instead of stalling on the default 10s serial etag retries.resume_chunked.pypasses through--resume-error-threshold;check_progress.pyAPI-mode banner parsing + resume detection; minimal baseline config gains the SERA Pattern A per-model entries; swe-agent harbor config tweaks.Validation
🤖 Generated with Claude Code