Skip to content

DanaResearchGroup/arcbench

Repository files navigation

arcbench

arcbench builds a diverse benchmark dataset of elementary gas-phase reactions for ARC/RMG workflows.

The target dataset is about 500 reactions:

  • about 100 experimental anchor reactions from RMG kinetics libraries
  • about 400 generated reactions from stochastic molecule generation
  • only reactions from the benchmark target set of ARC-supported RMG families
  • balanced for both chemical diversity and reaction-family diversity

The output is a JSON file containing Reaction SMILES, source/provenance, reaction family, spin multiplicities, and a string representation of the RMG reaction object. Each run also writes a sidecar audit summary documenting family coverage, candidate rejection counts, and selection behavior.

Intent

The project is a reaction curation pipeline, not just a random reaction generator.

It tries to answer:

Can we automatically build a chemically broad reaction benchmark that still represents the mechanistic reaction space ARC is expected to handle?

That means generated reactions must pass multiple gates:

  • molecule constraints: small molecules, allowed elements, allowed spin states
  • reaction constraints: unimolecular or bimolecular elementary reactions
  • ARC compatibility: reaction family must be in the benchmark target support set
  • duplicate control: canonical Reaction SMILES are used to avoid repeats
  • selection quality: final picks balance DRFP chemical diversity with reaction-family coverage

Benchmark Family Contract

ARC_SUPPORTED_FAMILIES in module1_db.py is intentionally hardcoded.

This is a benchmark support contract, not a live import from ARC main. The list includes families expected from ARC's pending linear/linear_ts support work, so it can be broader than the currently merged ARC ts_adapters_by_rmg_family mapping.

The audit summary records this explicitly under family_support_contract.

For paper methods, describe the family scope as:

The target family set was defined manually to match the ARC family-support
surface intended for this benchmark, including pending linear/linear_ts family
support. Coverage was audited against this target set and against the loaded
RMG-database snapshot.

Repository Layout

main.py            Pipeline orchestration, checkpointing, CLI, final output
module1_db.py      RMG database loading, constraints, anchors, family metadata
module2_gen.py     MindlessGen/RDKit molecule and radical generation
module3_react.py   RMG Species conversion and react() enumeration wrapper
module4_select.py  DRFP/PCA diversity selection with family-aware balancing
spec.md            Original project specification
agent.md           Development guidance used while creating the code
submit.sh          PBS cluster submission example

Logic Layers

Layer 1: Database, Constraints, And Anchors

Implemented in module1_db.py.

This layer defines which reactions are usable.

Global constraints:

  • allowed elements: C, H, O, N, Cl, S
  • maximum heavy atoms per molecule: 12
  • allowed multiplicities: singlet, doublet, triplet
  • allowed reaction shapes: 1 -> 1, 1 -> 2, 2 -> 1, 2 -> 2
  • allowed families: ARC_SUPPORTED_FAMILIES

The pipeline loads the RMG database, discovers kinetics libraries, drops superseded versionN libraries, and loads only ARC-supported families that are present in the local RMG-database checkout.

Anchor extraction works by:

  1. scanning RMG kinetics library reactions
  2. filtering by molecule and reaction constraints
  3. reclassifying library reactions into kinetics families
  4. grouping and deduplicating by Reaction SMILES
  5. selecting diverse anchor reactions within each family
  6. selecting the final anchor set with family-seeded MaxMin picking

This creates the initial experimental portion of the dataset.

Important functions:

  • load_rmg_database(...)
  • load_rmg_database_and_extract_anchors(...)
  • reaction_meets_constraints(...)
  • get_reaction_smiles(...)
  • reaction_metadata(...)
  • targeted_family_backfill(...)

Layer 2: Molecule And Radical Generation

Implemented in module2_gen.py.

This layer creates candidate reactant molecules.

The primary generator is MindlessGen. It randomly chooses an allowed elemental composition, asks MindlessGen for a molecule, converts XYZ coordinates into an RDKit molecule, infers bonds, and returns SMILES.

If MindlessGen or bond perception fails, the code falls back to lightweight RDKit fragment growth from simple seed molecules.

Radicals are generated by adding explicit hydrogens, removing sampled H atoms, and assigning one radical electron to the neighboring heavy atom.

module2_gen.py also has a long-running worker mode:

python module2_gen.py --worker

main.py starts this worker once and communicates with it over JSON lines. This keeps MindlessGen isolated in its own Python environment and avoids restarting the generator for every batch.

Important functions:

  • generate_random_molecule(...)
  • generate_radicals(...)
  • get_batch_of_molecules(...)
  • run_worker(...)

Layer 3: RMG Reaction Enumeration

Implemented in module3_react.py.

This layer converts generated SMILES into RMG reactions.

The flow is:

  1. convert SMILES to rmgpy.molecule.Molecule
  2. generate resonance structures when possible
  3. wrap the molecule in rmgpy.species.Species
  4. create all unimolecular reactant tuples
  5. create all bimolecular reactant tuples, including A + A
  6. call rmgpy.rmg.react.react(...)
  7. deduplicate generated reactions by Reaction SMILES

Reaction enumeration is chunked. If one chunk fails, the code retries that chunk one tuple at a time so a single bad species does not kill the entire iteration.

Every generated reaction is then filtered with the same benchmark constraints used for anchors before it can enter the candidate pool. This prevents generated products from slipping past molecule size, element, multiplicity, arity, or family-support checks.

Important functions:

  • smiles_to_species(...)
  • enumerate_reactions(...)

Layer 4: Family-Aware Diversity Selection

Implemented in module4_select.py.

This layer decides which generated reactions are kept.

The selector uses DRFP fingerprints and optional PCA to measure chemical reaction-space distance, but it does not rely on chemical distance alone.

The final selector is family-aware because pure DRFP diversity can over-select one broad reaction family. For example, if H_Abstraction contributes many chemically varied reactions, a pure MaxMin picker could choose hundreds of H_Abstraction reactions while underrepresenting other mechanisms.

The current selection policy is:

  1. Treat existing anchors as already selected.
  2. Try to reach min_per_family for every family that has candidates.
  3. Fill remaining slots by greedy DRFP/PCA distance from the selected set.
  4. Dynamically penalize families that are already common.
  5. Avoid adding to families above max_family_fraction, unless every remaining candidate is capped and the selector must relax the cap to finish.

The dynamic score is conceptually:

score = distance_to_selected * family_balance_weight

where distance_to_selected rewards chemical novelty and family_balance_weight decreases as a family becomes overrepresented.

This gives two kinds of diversity:

  • chemical diversity: selected reactions are spread out in DRFP/PCA space
  • reaction-space diversity: selected reactions are distributed across RMG reaction families

Future improvement: family plus topology stratification may be useful, e.g. balancing family + arity or family + multiplicity pattern. The current policy keeps that as an audit dimension rather than a hard selection rule.

Important functions:

  • encode_with_cache(...)
  • select_diverse_reactions(...)
  • select_family_aware_diverse_reactions(...)

select_diverse_reactions(...) is the older SMILES-only picker and is kept for compatibility. The main pipeline uses select_family_aware_diverse_reactions(...).

Layer 5: Pipeline Orchestration

Implemented in main.py.

The main pipeline is:

load RMG database
extract experimental anchors
start MindlessGen worker
generate molecules/radicals
convert to RMG Species
enumerate reactions with RMG react()
deduplicate reactions
accumulate candidate pool
run final family-aware diversity selection
backfill missing/underfilled families from curated seeds
write JSON output
write audit summary

The generated reactions are not accepted batch-by-batch. Instead, the pipeline accumulates a large candidate pool first, then performs one final global selection. This avoids early batches dominating the output.

Checkpoints are written to:

<output_file>.ckpt.json

Use --resume to continue from an existing checkpoint.

Selection And Coverage Knobs

The final selector exposes three reaction-family balance controls:

--min-per-family 3
--max-family-fraction 0.15
--family-penalty-alpha 0.75

--min-per-family controls minimum attempted family representation, counting anchors that are already selected. After final selection, targeted backfill also tries to lift underfilled families to this count using curated seed reactions.

--max-family-fraction controls the cap for any one family during final fill. For a 500-reaction dataset and 0.15, the cap is about 75 reactions per family. The cap can relax only if all remaining candidates are already capped.

--family-penalty-alpha controls how strongly common families are penalized. Higher values push the final dataset toward flatter family balance. Lower values let DRFP chemical diversity dominate more strongly.

Recommended starting values:

--min-per-family 3 \
--max-family-fraction 0.15 \
--family-penalty-alpha 0.75

For stricter family balance:

--min-per-family 5 \
--max-family-fraction 0.10 \
--family-penalty-alpha 1.25

For more chemistry-first selection:

--min-per-family 2 \
--max-family-fraction 0.25 \
--family-penalty-alpha 0.25

Audit Summary

Each run writes:

<output_file>.summary.json

The summary is designed to support paper-methods defensibility. It includes:

  • dataset size and candidate-pool size
  • seed and reproducibility policy
  • hardcoded family-support contract metadata
  • selection parameters
  • selection audit details:
    • minimum-coverage picks
    • weighted-diversity picks
    • cap-relaxed picks
    • hard family cap
    • selected family counts
  • generated-reaction rejection counters:
    • constraint rejects
    • duplicate rejects
    • SMILES conversion rejects
  • final family counts
  • source counts
  • arity counts
  • multiplicity-pattern counts
  • missing target families
  • target families absent from the loaded RMG database
  • families still below --min-per-family

This is the main artifact to inspect before claiming the dataset covers the ARC benchmark target surface.

Running Locally

Install the Python dependencies in an environment that can import RDKit, DRFP, scikit-learn, and MindlessGen:

uv sync

RMG-Py is usually installed separately, often in a conda environment. The main process must be able to import rmgpy.

The first RMG-database load writes a pickled cache to .arcbench_cache/rmg_database/. Later runs reuse that cache when the selected RMG-database families/libraries, kinetics and thermo depositories, thermo groups, forbidden structures, loader options, Python version, RMG-Py version, and database git identity hash to the same key. Set ARCBENCH_RMG_CACHE_DIR=/path/to/cache to share a cache location, or ARCBENCH_RMG_DISABLE_CACHE=1 to force a fresh load.

Run:

python main.py /path/to/RMG-database out.json \
  --cpus 8 \
  --batch-size 30 \
  --seed 42 \
  --checkpoint-every 5 \
  --anchors 100

By default, the MindlessGen worker is launched with:

uv run --python 3.12 python

Override this if MindlessGen lives in another environment:

python main.py /path/to/RMG-database out.json \
  --mindlessgen-cmd "conda run -n mindlessgen --no-capture-output python"

or:

export MINDLESSGEN_CMD="conda run -n mindlessgen --no-capture-output python"
python main.py /path/to/RMG-database out.json

Reproducibility

Use --seed for deterministic run-level seeding:

python main.py /path/to/RMG-database out.json --seed 42

When --seed is set, the main process seeds Python and NumPy, and each MindlessGen worker request receives a deterministic per-iteration seed:

worker_seed = run_seed + iteration * 1000003

Within each worker request, base-molecule generation also receives deterministic per-molecule seeds so multiprocessing scheduling should not change the intended random sequence.

For paper reproduction, treat the generated artifacts as canonical:

out.json
out.json.summary.json

Archive both with the paper or benchmark release. The seed improves repeatability, but dependency versions, MindlessGen, RDKit, RMG-Py, and the RMG-database snapshot can still affect exact generated molecules and reactions. The fixed benchmark should therefore be the archived output dataset plus audit summary, not just the command line.

Running On PBS

submit.sh is an example PBS job script. It:

  • requests 60 CPUs and 100 GB memory
  • activates an RMG conda environment
  • sets MINDLESSGEN_CMD to a separate MindlessGen environment
  • writes live logs through tee to job.out and job.err
  • runs main.py against an RMG-database checkout

Submit with:

qsub submit.sh

Output Schema

Each output entry is a JSON object similar to:

{
  "smiles": "[CH3].O=O>>CO[O]",
  "source": "mindless_novel",
  "rxn_obj": "CH3 + O2 <=> CH3OO",
  "multiplicity_reactants": [2, 3],
  "multiplicity_products": [2],
  "family": "R_Addition_MultipleBond",
  "is_anchor": false,
  "kinetics_source": null
}

Sources:

  • rmg_library: experimental/library anchor from RMG
  • mindless_novel: generated molecule/radical followed by RMG enumeration
  • arc_seed: targeted backfill from curated family seed SMILES

Caches And Logs

The pipeline may create:

rmg.log                 RMG warnings/errors redirected away from job.err
.drfp_cache.pkl         persisted DRFP fingerprint cache
<output>.ckpt.json      resumable checkpoint
<output>.summary.json   family/diversity/coverage audit
job.out / job.err       PBS run logs when using submit.sh

Practical Notes

The final dataset can exceed 500 reactions slightly if targeted family backfill adds missing families after the main selection.

Family coverage depends on the local RMG-database checkout and the curated seed coverage. Some target families may not exist in a given database snapshot, and some may not have available library anchors or seed reactions. These cases are reported in the audit summary rather than hidden.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors