Skip to content

LeiDQ/Frontier-Engineering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Frontier-Eng: Large-Scale Engineering Optimization Benchmark for AI Agents

English | 简体中文

Frontier-Eng is a benchmark designed to evaluate the ability of AI Agents to solve open-ended optimization problems in real-world engineering domains.

Unlike existing benchmarks that focus on Computer Science (CS) or purely abstract mathematical problems, Frontier-Eng focuses on engineering challenges with actual economic benefits and physical constraints. It is expected to cover multiple fields such as aerospace, civil engineering, EDA, bioengineering, and more.

🎯 Motivation

Current AI4Research evaluation systems have the following limitations:

  1. Limited Evaluation Methods: Most adopt 0/1 binary evaluation or closed-interval rubrics, failing to effectively measure an Agent's ability to perform iterative optimization through interaction in an open world.
  2. Domain Limitations: Existing benchmarks are mostly confined to the CS domain (e.g., code generation) or highly abstract real problems into math problems, stripping away real-world complexity and preventing Agents from utilizing rich external knowledge and tools.
  3. Metric Bias: Traditional computational metrics focus on model average performance, whereas for engineering optimization problems, we should focus more on the Peak Performance a model can achieve on a single problem through exploration mechanisms.

Frontier-Eng aims to evaluate the ability of Agents to solve problems with practical value across a wide range of engineering disciplines by providing rich context and tool support.

🤝 Contribution Guidelines

We need the power of the community to expand the coverage of the Benchmark. We welcome the submission of new engineering problems via Pull Requests (PR). If you wish to contribute, please follow the standards and processes below:

Sample Requirements

  1. Reality Gap: Must be close to reality, considering real-world influencing factors, not purely abstract mathematics.
  2. Economic Value: The problem should have clear engineering or economic value upon solution.
  3. Verifiability: Must provide an executable verification program (Docker preferred) capable of completing the evaluation within an acceptable time.

Submission Format

Each Task should contain the following file structure:

<Domain_Name>/                       # Level 1 Directory: Domain Name (e.g., Astrodynamics)
├── README.md                        # [Required] Domain Overview (Default entry, EN or CN): Background & sub-task index
├── README_zh-CN.md                  # [Optional] Domain Overview (Chinese version. Used only if README.md is in English)
├── <Task_Name_A>/                   # Level 2 Directory: Specific Task Name (e.g., MannedLunarLanding)
│   ├── README.md                    # [Required] Navigation Doc: File structure, how to run & quick start
│   ├── README_zh-CN.md              # [Optional] Navigation Doc (Chinese version)
│   ├── Task.md                      # [Required] Task Detail Doc: Core doc including background, physical model, I/O definitions
│   ├── Task_zh-CN.md                # [Optional] Task Detail Doc (Chinese version)
│   ├── references/                  # References Directory
│   │   ├── constants.json           # Physical constants, simulation parameters, etc.
│   │   └── manuals.pdf              # Domain knowledge manual, physical equations, or constraints docs
│   ├── verification/                # Verification & Scoring System
│   │   ├── evaluator.py             # [Core] Scoring script entry point
│   │   ├── requirements.txt         # Dependencies required for the scoring environment
│   │   └── docker/                  # Environment containerization configuration
│   │       └── Dockerfile           # Ensures consistency of the evaluation environment
│   └── baseline/                    # [Optional] Baseline Solution / Example Code
│       ├── solution.py              # Reference code implementation
│       └── result_log.txt           # Execution log or scoring result of the reference code
└── <Task_Name_B>/                   # Another task under this domain
    └── ...

The above directory structure serves only as a reference template. Contributors may adjust the file organization based on specific circumstances, provided that all core elements (e.g., background, input/output, evaluation metrics) are included. Additionally, there are no restrictions on the programming language and format of the verification code.

Contribution Process

We adopt the standard GitHub collaboration flow:

  1. Fork this Repository: Click the "Fork" button in the top right corner to copy the project to your GitHub account.
  2. Create Branch:
  • Clone your Fork locally.
  • Create a new branch for development, recommended naming format: feat/<Domain>/<TaskName> (e.g., feat/Astrodynamics/MarsLanding).
  1. Add/Modify Content:
  • Add your engineering problem files following the submission format above.
  • Ensure all necessary explanatory documentation and verification code are included.
  1. Local Test: Run evaluator.py or build the Docker image to ensure the evaluation logic is correct and runs normally.
  2. Submit Pull Request (PR):
  • Push changes to your remote Fork.
  • Initiate a Pull Request to the main branch of this repository.
  • PR Description: Please briefly explain the background, source, and how to run the verification code for the Task.
  1. Code Review:
  • Agent Review: After submitting the PR, an AI Agent will first conduct an automated preliminary review (including code standards, basic logic verification, etc.) and may propose modifications directly in the PR.
  • Maintainer Review: After the Agent review passes, maintainers will conduct a final re-check. Once confirmed correct, your contribution will be merged.

💡 If this is your first contribution or you have questions about the directory structure, feel free to submit an Issue for discussion first.

📊 Task Progress & Planning

The table below lists the current coverage of domain tasks in the Benchmark. We welcome not only code contributions but also ideas for challenging new engineering problems from the community.

Domain Task Name Status Maintainer/Contributor Remarks
Astrodynamics MannedLunarLanding Completed @jdp22 Lunar soft landing trajectory optimization
Electronic Design Automation Integration Physical Design Optimization Under Development @ahydchh Chip Macrocell Layout Optimization
Kernel Engineering MLA Basically Completed, Awaiting Verification @ahydchh Kernel Engineering
Kernel Engineering TriMul Basically Completed, Awaiting Verification @ahydchh Kernel Engineering
Single Cell Analysis denoising Basically Completed, Awaiting Verification @ahydchh Single Cell Analysis
Single Cell Analysis perturbation_prediction Completed OpenProblems perturbation response prediction (NeurIPS 2023 scPerturb)

💡 Have an idea for a new engineering problem? Even if you cannot provide complete verification code for now, we highly welcome you to share good Task concepts! Please create an Issue detailing the real-world background and engineering value of the problem. After discussion and confirmation, we will add it to the table above to rally community power to solve it together.

🧪 Evaluation Framework

An initial integration between some evaluation algorithms and benchmarks has been implemented. The core implementation is located in ./frontier_eval. For usage instructions, see the Evaluation README.

💬 Join the Community

Welcome to our developer community! Whether you want to discuss new engineering problem concepts, find task collaborators, or encounter technical issues during your contribution, you can always communicate with us in the group.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 74.9%
  • MATLAB 17.1%
  • Jupyter Notebook 7.5%
  • Shell 0.5%