Release: mcpmark verified — pinned versions + stabilized standard verifiers by zjwu0522 · Pull Request #264 · eval-sys/mcpmark

zjwu0522 · 2026-06-12T09:54:10Z

MCPMark Verified

MCPMark Verified is a stabilized version of MCPMark's standard task set. The tasks are unchanged. What changed is that every environment is pinned to a fixed server version, and every verification script has been reviewed and tightened so that a correct solution passes and an incorrect one fails, consistently across runs and over time.

This PR promotes the pin-all-versions integration branch to main for the release.

Pinned environments

Environment	Pinned server
Filesystem	`@modelcontextprotocol/server-filesystem@2025.12.18`
GitHub	`ghcr.io/github/github-mcp-server:v0.15.0`
Notion	`@notionhq/notion-mcp-server@1.9.1`
Playwright	`@playwright/mcp@0.0.68`
Postgres	`postgres-mcp==0.3.0`

The evaluation harness is pinned as well (model call parameters, reasoning-effort handling, and the agent loop), so the model under test is the only variable across runs.

Verifier changes

All 127 standard tasks were reviewed. Fixes fall into two categories.

Major — a verifier or its fixture was rebuilt:

Postgres dba_vector_analysis: the 500-line vector fixture is inlined into setup and the verifier rewritten for deterministic state.
Playwright extraction_table: regenerated the reference data and rewrote the extraction checks.
WebArena search_filtering_operations and the shopping-admin analytics tasks (fitness_promotion_strategy, marketing_customer_analysis, sales_inventory_analysis, customer_segmentation_setup): reworked logic and clarified descriptions.
Notion work_history_addition, hyperfocus_analysis_report, quarterly_review_dashboard: overhauled verifiers and descriptions on the most error-prone pages.

Minor — targeted robustness fixes: GitHub (PR-title-aware squash detection, case-insensitive matching, pinned ESLint v8), Postgres (tighter acceptance conditions and role cleanup), Filesystem (clarified descriptions), and smart-quote normalization across WebArena.

Models / reasoning effort

Added a public gpt-5.5 model entry, plus xhigh and max reasoning-effort levels.
LiteLLM config: enforcer_mode, think_mode, max_tokens, temperature.

Rolled-up PRs

#252, #255 (github) · #254 (playwright) · #260 (notion) · #262 (postgres) · #261 (reasoning effort) · #263 (github legacy_name)

🤖 Generated with Claude Code

Co-authored-by: xyliugo liuxiangyan6@gmail.com
Co-authored-by: dulingxiao lxdu0314@gmail.com

…nd verify scripts

…gement task Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Pin filesystem (@2025.12.18), postgres (0.3.0), and playwright (0.0.68) versions. Also pin notion @1.9.1 in base_agent.py for consistency with mcpmark_agent.py. GitHub (v0.15.0) and notion were already pinned in #246. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…teLLM config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mcpmark-cicd needs to be public for GitHub Actions workflows to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…malization (#254)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: dulingxiao <lxdu0314@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

xyliugo

lgtm

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # README.md

zjwu0522 and others added 12 commits April 14, 2026 09:34

fix: refine 7 filesystem standard tasks with clarified descriptions a…

118f0ea

…nd verify scripts

fix: correct directory name from desktop_2 to desktop in project_mana…

8f96327

…gement task Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add enforcer_mode, think_mode, max_tokens and temperature to Li…

3ef687d

…teLLM config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: exempt mcpmark-cicd repo from private-only restriction

175f75f

mcpmark-cicd needs to be public for GitHub Actions workflows to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: correct verification scripts for standard GitHub tasks (#252)

84d68e2

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: stabilize verification scripts for 5 standard GitHub tasks (#255)

833a4a5

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tighten standard playwright_webarena verify scripts + smart-quote nor…

6cdd3a3

…malization (#254)

Stabilize standard Notion verify scripts + descriptions (#260)

62f0c12

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: tighten standard PostgreSQL task verifiers (#262)

d03f504

Co-authored-by: dulingxiao <lxdu0314@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: allow 'max' reasoning effort (#261)

8067263

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix: case-insensitive match for missing-semester find_legacy_name (#263)

dfa8f98

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

xyliugo approved these changes Jun 12, 2026

View reviewed changes

zjwu0522 and others added 2 commits June 12, 2026 10:20

docs: announce MCPMark Verified in README

1da6109

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge main into pin-all-versions (bring #258 docs)

7bbea1d

# Conflicts: # README.md

zjwu0522 merged commit 84faaca into main Jun 12, 2026
2 checks passed

zjwu0522 deleted the pin-all-versions branch June 12, 2026 10:42

zjwu0522 mentioned this pull request Jun 12, 2026

docs: mark MCPMark Verified as the default task set #265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: mcpmark verified — pinned versions + stabilized standard verifiers#264

Release: mcpmark verified — pinned versions + stabilized standard verifiers#264
zjwu0522 merged 14 commits into
mainfrom
pin-all-versions

zjwu0522 commented Jun 12, 2026 •

edited

Loading

Uh oh!

xyliugo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zjwu0522 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCPMark Verified

Pinned environments

Verifier changes

Models / reasoning effort

Rolled-up PRs

Uh oh!

xyliugo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zjwu0522 commented Jun 12, 2026 •

edited

Loading