English | 简体中文
Frontier-Eng is a benchmark designed to evaluate the ability of AI Agents to solve open-ended optimization problems in real-world engineering domains.
Unlike existing benchmarks that focus on Computer Science (CS) or purely abstract mathematical problems, Frontier-Eng focuses on engineering challenges with actual economic benefits and physical constraints. It is expected to cover multiple fields such as aerospace, civil engineering, EDA, bioengineering, and more.
Current AI4Research evaluation systems have the following limitations:
- Limited Evaluation Methods: Most adopt 0/1 binary evaluation or closed-interval rubrics, failing to effectively measure an Agent's ability to perform iterative optimization through interaction in an open world.
- Domain Limitations: Existing benchmarks are mostly confined to the CS domain (e.g., code generation) or highly abstract real problems into math problems, stripping away real-world complexity and preventing Agents from utilizing rich external knowledge and tools.
- Metric Bias: Traditional computational metrics focus on model average performance, whereas for engineering optimization problems, we should focus more on the Peak Performance a model can achieve on a single problem through exploration mechanisms.
Frontier-Eng aims to evaluate the ability of Agents to solve problems with practical value across a wide range of engineering disciplines by providing rich context and tool support.
We need the power of the community to expand the coverage of the Benchmark. We welcome the submission of new engineering problems via Pull Requests (PR). If you wish to contribute, please follow the standards and processes below:
- Reality Gap: Must be close to reality, considering real-world influencing factors, not purely abstract mathematics.
- Economic Value: The problem should have clear engineering or economic value upon solution.
- Verifiability: Must provide an executable verification program (Docker preferred) capable of completing the evaluation within an acceptable time.
Each Task should contain the following file structure:
<Domain_Name>/ # Level 1 Directory: Domain Name (e.g., Astrodynamics)
├── README.md # [Required] Domain Overview (Default entry, EN or CN): Background & sub-task index
├── README_zh-CN.md # [Optional] Domain Overview (Chinese version. Used only if README.md is in English)
├── <Task_Name_A>/ # Level 2 Directory: Specific Task Name (e.g., MannedLunarLanding)
│ ├── README.md # [Required] Navigation Doc: File structure, how to run & quick start
│ ├── README_zh-CN.md # [Optional] Navigation Doc (Chinese version)
│ ├── Task.md # [Required] Task Detail Doc: Core doc including background, physical model, I/O definitions
│ ├── Task_zh-CN.md # [Optional] Task Detail Doc (Chinese version)
│ ├── references/ # References Directory
│ │ ├── constants.json # Physical constants, simulation parameters, etc.
│ │ └── manuals.pdf # Domain knowledge manual, physical equations, or constraints docs
│ ├── verification/ # Verification & Scoring System
│ │ ├── evaluator.py # [Core] Scoring script entry point
│ │ ├── requirements.txt # Dependencies required for the scoring environment
│ │ └── docker/ # Environment containerization configuration
│ │ └── Dockerfile # Ensures consistency of the evaluation environment
│ └── baseline/ # [Optional] Baseline Solution / Example Code
│ ├── solution.py # Reference code implementation
│ └── result_log.txt # Execution log or scoring result of the reference code
└── <Task_Name_B>/ # Another task under this domain
└── ...
The above directory structure serves only as a reference template. Contributors may adjust the file organization based on specific circumstances, provided that all core elements (e.g., background, input/output, evaluation metrics) are included. Additionally, there are no restrictions on the programming language and format of the verification code.
We adopt the standard GitHub collaboration flow:
- Fork this Repository: Click the "Fork" button in the top right corner to copy the project to your GitHub account.
- Create Branch:
- Clone your Fork locally.
- Create a new branch for development, recommended naming format:
feat/<Domain>/<TaskName>(e.g.,feat/Astrodynamics/MarsLanding).
- Add/Modify Content:
- Add your engineering problem files following the submission format above.
- Ensure all necessary explanatory documentation and verification code are included.
- Local Test: Run
evaluator.pyor build the Docker image to ensure the evaluation logic is correct and runs normally. - Submit Pull Request (PR):
- Push changes to your remote Fork.
- Initiate a Pull Request to the
mainbranch of this repository. - PR Description: Please briefly explain the background, source, and how to run the verification code for the Task.
- Code Review:
- Agent Review: After submitting the PR, an AI Agent will first conduct an automated preliminary review (including code standards, basic logic verification, etc.) and may propose modifications directly in the PR.
- Maintainer Review: After the Agent review passes, maintainers will conduct a final re-check. Once confirmed correct, your contribution will be merged.
💡 If this is your first contribution or you have questions about the directory structure, feel free to submit an Issue for discussion first.
The table below lists the current coverage of domain tasks in the Benchmark. We welcome not only code contributions but also ideas for challenging new engineering problems from the community.
| Domain | Task Name | Status | Maintainer/Contributor | Remarks |
|---|---|---|---|---|
| Astrodynamics | MannedLunarLanding |
Completed | @jdp22 | Lunar soft landing trajectory optimization |
| Electronic Design Automation | Integration Physical Design Optimization |
Under Development | @ahydchh | Chip Macrocell Layout Optimization |
| Kernel Engineering | MLA |
Basically Completed, Awaiting Verification | @ahydchh | Kernel Engineering |
| Kernel Engineering | TriMul |
Basically Completed, Awaiting Verification | @ahydchh | Kernel Engineering |
| Single Cell Analysis | denoising |
Basically Completed, Awaiting Verification | @ahydchh | Single Cell Analysis |
| Single Cell Analysis | perturbation_prediction |
Completed | — | OpenProblems perturbation response prediction (NeurIPS 2023 scPerturb) |
💡 Have an idea for a new engineering problem? Even if you cannot provide complete verification code for now, we highly welcome you to share good Task concepts! Please create an Issue detailing the real-world background and engineering value of the problem. After discussion and confirmation, we will add it to the table above to rally community power to solve it together.
An initial integration between some evaluation algorithms and benchmarks has been implemented. The core implementation is located in ./frontier_eval. For usage instructions, see the Evaluation README.
Welcome to our developer community! Whether you want to discuss new engineering problem concepts, find task collaborators, or encounter technical issues during your contribution, you can always communicate with us in the group.
-
🟢 Feishu (Lark): Click here to join our Feishu discussion group
-
🔜 Discord / Slack: (Preparing, coming soon...)