Frontier-Eng: Large-Scale Engineering Optimization Benchmark for AI Agents

Frontier-Eng is a benchmark designed to evaluate the ability of AI Agents to solve open-ended optimization problems in real-world engineering domains.

Unlike existing benchmarks that focus on Computer Science (CS) or purely abstract mathematical problems, Frontier-Eng focuses on engineering challenges with actual economic benefits and physical constraints. It is expected to cover multiple fields such as aerospace, civil engineering, EDA, bioengineering, and more.

🎯 Motivation

Current AI4Research evaluation systems have the following limitations:

Limited Evaluation Methods: Most adopt 0/1 binary evaluation or closed-interval rubrics, failing to effectively measure an Agent's ability to perform iterative optimization through interaction in an open world.
Domain Limitations: Existing benchmarks are mostly confined to the CS domain (e.g., code generation) or highly abstract real problems into math problems, stripping away real-world complexity and preventing Agents from utilizing rich external knowledge and tools.
Metric Bias: Traditional computational metrics focus on model average performance, whereas for engineering optimization problems, we should focus more on the Peak Performance a model can achieve on a single problem through exploration mechanisms.

Frontier-Eng aims to evaluate the ability of Agents to solve problems with practical value across a wide range of engineering disciplines by providing rich context and tool support.

🤝 Contribution Guidelines

We need the power of the community to expand the coverage of the Benchmark. We welcome the submission of new engineering problems via Pull Requests (PR). If you wish to contribute, please follow the standards and processes below:

Sample Requirements

Reality Gap: Must be close to reality, considering real-world influencing factors, not purely abstract mathematics.
Economic Value: The problem should have clear engineering or economic value upon solution.
Verifiability: Must provide an executable verification program (Docker preferred) capable of completing the evaluation within an acceptable time.

Submission Format

Each Task should contain the following file structure:

<Domain_Name>/                       # Level 1 Directory: Domain Name (e.g., Astrodynamics)
├── README.md                        # [Required] Domain Overview (Default entry, EN or CN): Background & sub-task index
├── README_zh-CN.md                  # [Optional] Domain Overview (Chinese version. Used only if README.md is in English)
├── <Task_Name_A>/                   # Level 2 Directory: Specific Task Name (e.g., MannedLunarLanding)
│   ├── README.md                    # [Required] Navigation Doc: File structure, how to run & quick start
│   ├── README_zh-CN.md              # [Optional] Navigation Doc (Chinese version)
│   ├── Task.md                      # [Required] Task Detail Doc: Core doc including background, physical model, I/O definitions
│   ├── Task_zh-CN.md                # [Optional] Task Detail Doc (Chinese version)
│   ├── references/                  # References Directory
│   │   ├── constants.json           # Physical constants, simulation parameters, etc.
│   │   └── manuals.pdf              # Domain knowledge manual, physical equations, or constraints docs
│   ├── verification/                # Verification & Scoring System
│   │   ├── evaluator.py             # [Core] Scoring script entry point
│   │   ├── requirements.txt         # Dependencies required for the scoring environment
│   │   └── docker/                  # Environment containerization configuration
│   │       └── Dockerfile           # Ensures consistency of the evaluation environment
│   └── baseline/                    # [Optional] Baseline Solution / Example Code
│       ├── solution.py              # Reference code implementation
│       └── result_log.txt           # Execution log or scoring result of the reference code
└── <Task_Name_B>/                   # Another task under this domain
    └── ...

The above directory structure serves only as a reference template. Contributors may adjust the file organization based on specific circumstances, provided that all core elements (e.g., background, input/output, evaluation metrics) are included. Additionally, there are no restrictions on the programming language and format of the verification code.

Contribution Process

We adopt the standard GitHub collaboration flow:

Fork this Repository: Click the "Fork" button in the top right corner to copy the project to your GitHub account.
Create Branch:

Clone your Fork locally.
Create a new branch for development, recommended naming format: feat/<Domain>/<TaskName> (e.g., feat/Astrodynamics/MarsLanding).

Add/Modify Content:

Add your engineering problem files following the submission format above.
Ensure all necessary explanatory documentation and verification code are included.

Local Test: Run evaluator.py or build the Docker image to ensure the evaluation logic is correct and runs normally.
Submit Pull Request (PR):

Push changes to your remote Fork.
Initiate a Pull Request to the main branch of this repository.
PR Description: Please briefly explain the background, source, and how to run the verification code for the Task.

Code Review:

Agent Review: After submitting the PR, an AI Agent will first conduct an automated preliminary review (including code standards, basic logic verification, etc.) and may propose modifications directly in the PR.
Maintainer Review: After the Agent review passes, maintainers will conduct a final re-check. Once confirmed correct, your contribution will be merged.

💡 If this is your first contribution or you have questions about the directory structure, feel free to submit an Issue for discussion first.

📊 Task Progress & Planning

The table below lists the current coverage of domain tasks in the Benchmark. We welcome not only code contributions but also ideas for challenging new engineering problems from the community.

Domain	Task Name	Status	Maintainer/Contributor	Remarks
Astrodynamics	`MannedLunarLanding`	Completed	@jdp22	Lunar soft landing trajectory optimization
Electronic Design Automation	`Integration Physical Design Optimization`	Under Development	@ahydchh	Chip Macrocell Layout Optimization
Kernel Engineering	`MLA`	Basically Completed, Awaiting Verification	@ahydchh	Kernel Engineering
Kernel Engineering	`TriMul`	Basically Completed, Awaiting Verification	@ahydchh	Kernel Engineering
Single Cell Analysis	`denoising`	Basically Completed, Awaiting Verification	@ahydchh	Single Cell Analysis
Single Cell Analysis	`perturbation_prediction`	Completed	—	OpenProblems perturbation response prediction (NeurIPS 2023 scPerturb)

💡 Have an idea for a new engineering problem? Even if you cannot provide complete verification code for now, we highly welcome you to share good Task concepts! Please create an Issue detailing the real-world background and engineering value of the problem. After discussion and confirmation, we will add it to the table above to rally community power to solve it together.

🧪 Evaluation Framework

An initial integration between some evaluation algorithms and benchmarks has been implemented. The core implementation is located in ./frontier_eval. For usage instructions, see the Evaluation README.

💬 Join the Community

Welcome to our developer community! Whether you want to discuss new engineering problem concepts, find task collaborators, or encounter technical issues during your contribution, you can always communicate with us in the group.

🟢 Feishu (Lark): Click here to join our Feishu discussion group
🔜 Discord / Slack: (Preparing, coming soon...)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
frontier_eval		frontier_eval
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README_zh-CN.md		README_zh-CN.md
init.sh		init.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Frontier-Eng: Large-Scale Engineering Optimization Benchmark for AI Agents

🎯 Motivation

🤝 Contribution Guidelines

Sample Requirements

Submission Format

Contribution Process

📊 Task Progress & Planning

🧪 Evaluation Framework

💬 Join the Community

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Frontier-Eng: Large-Scale Engineering Optimization Benchmark for AI Agents

🎯 Motivation

🤝 Contribution Guidelines

Sample Requirements

Submission Format

Contribution Process

📊 Task Progress & Planning

🧪 Evaluation Framework

💬 Join the Community

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages