arcbench builds a diverse benchmark dataset of elementary gas-phase
reactions for ARC/RMG workflows.
The target dataset is about 500 reactions:
- about 100 experimental anchor reactions from RMG kinetics libraries
- about 400 generated reactions from stochastic molecule generation
- only reactions from the benchmark target set of ARC-supported RMG families
- balanced for both chemical diversity and reaction-family diversity
The output is a JSON file containing Reaction SMILES, source/provenance, reaction family, spin multiplicities, and a string representation of the RMG reaction object. Each run also writes a sidecar audit summary documenting family coverage, candidate rejection counts, and selection behavior.
The project is a reaction curation pipeline, not just a random reaction generator.
It tries to answer:
Can we automatically build a chemically broad reaction benchmark that still represents the mechanistic reaction space ARC is expected to handle?
That means generated reactions must pass multiple gates:
- molecule constraints: small molecules, allowed elements, allowed spin states
- reaction constraints: unimolecular or bimolecular elementary reactions
- ARC compatibility: reaction family must be in the benchmark target support set
- duplicate control: canonical Reaction SMILES are used to avoid repeats
- selection quality: final picks balance DRFP chemical diversity with reaction-family coverage
ARC_SUPPORTED_FAMILIES in module1_db.py is intentionally hardcoded.
This is a benchmark support contract, not a live import from ARC main. The list
includes families expected from ARC's pending linear/linear_ts support work, so
it can be broader than the currently merged ARC ts_adapters_by_rmg_family
mapping.
The audit summary records this explicitly under family_support_contract.
For paper methods, describe the family scope as:
The target family set was defined manually to match the ARC family-support
surface intended for this benchmark, including pending linear/linear_ts family
support. Coverage was audited against this target set and against the loaded
RMG-database snapshot.
main.py Pipeline orchestration, checkpointing, CLI, final output
module1_db.py RMG database loading, constraints, anchors, family metadata
module2_gen.py MindlessGen/RDKit molecule and radical generation
module3_react.py RMG Species conversion and react() enumeration wrapper
module4_select.py DRFP/PCA diversity selection with family-aware balancing
spec.md Original project specification
agent.md Development guidance used while creating the code
submit.sh PBS cluster submission example
Implemented in module1_db.py.
This layer defines which reactions are usable.
Global constraints:
- allowed elements:
C,H,O,N,Cl,S - maximum heavy atoms per molecule:
12 - allowed multiplicities: singlet, doublet, triplet
- allowed reaction shapes:
1 -> 1,1 -> 2,2 -> 1,2 -> 2 - allowed families:
ARC_SUPPORTED_FAMILIES
The pipeline loads the RMG database, discovers kinetics libraries, drops
superseded versionN libraries, and loads only ARC-supported families that are
present in the local RMG-database checkout.
Anchor extraction works by:
- scanning RMG kinetics library reactions
- filtering by molecule and reaction constraints
- reclassifying library reactions into kinetics families
- grouping and deduplicating by Reaction SMILES
- selecting diverse anchor reactions within each family
- selecting the final anchor set with family-seeded MaxMin picking
This creates the initial experimental portion of the dataset.
Important functions:
load_rmg_database(...)load_rmg_database_and_extract_anchors(...)reaction_meets_constraints(...)get_reaction_smiles(...)reaction_metadata(...)targeted_family_backfill(...)
Implemented in module2_gen.py.
This layer creates candidate reactant molecules.
The primary generator is MindlessGen. It randomly chooses an allowed elemental composition, asks MindlessGen for a molecule, converts XYZ coordinates into an RDKit molecule, infers bonds, and returns SMILES.
If MindlessGen or bond perception fails, the code falls back to lightweight RDKit fragment growth from simple seed molecules.
Radicals are generated by adding explicit hydrogens, removing sampled H atoms, and assigning one radical electron to the neighboring heavy atom.
module2_gen.py also has a long-running worker mode:
python module2_gen.py --workermain.py starts this worker once and communicates with it over JSON lines. This
keeps MindlessGen isolated in its own Python environment and avoids restarting
the generator for every batch.
Important functions:
generate_random_molecule(...)generate_radicals(...)get_batch_of_molecules(...)run_worker(...)
Implemented in module3_react.py.
This layer converts generated SMILES into RMG reactions.
The flow is:
- convert SMILES to
rmgpy.molecule.Molecule - generate resonance structures when possible
- wrap the molecule in
rmgpy.species.Species - create all unimolecular reactant tuples
- create all bimolecular reactant tuples, including
A + A - call
rmgpy.rmg.react.react(...) - deduplicate generated reactions by Reaction SMILES
Reaction enumeration is chunked. If one chunk fails, the code retries that chunk one tuple at a time so a single bad species does not kill the entire iteration.
Every generated reaction is then filtered with the same benchmark constraints used for anchors before it can enter the candidate pool. This prevents generated products from slipping past molecule size, element, multiplicity, arity, or family-support checks.
Important functions:
smiles_to_species(...)enumerate_reactions(...)
Implemented in module4_select.py.
This layer decides which generated reactions are kept.
The selector uses DRFP fingerprints and optional PCA to measure chemical reaction-space distance, but it does not rely on chemical distance alone.
The final selector is family-aware because pure DRFP diversity can over-select
one broad reaction family. For example, if H_Abstraction contributes many
chemically varied reactions, a pure MaxMin picker could choose hundreds of
H_Abstraction reactions while underrepresenting other mechanisms.
The current selection policy is:
- Treat existing anchors as already selected.
- Try to reach
min_per_familyfor every family that has candidates. - Fill remaining slots by greedy DRFP/PCA distance from the selected set.
- Dynamically penalize families that are already common.
- Avoid adding to families above
max_family_fraction, unless every remaining candidate is capped and the selector must relax the cap to finish.
The dynamic score is conceptually:
score = distance_to_selected * family_balance_weight
where distance_to_selected rewards chemical novelty and
family_balance_weight decreases as a family becomes overrepresented.
This gives two kinds of diversity:
- chemical diversity: selected reactions are spread out in DRFP/PCA space
- reaction-space diversity: selected reactions are distributed across RMG reaction families
Future improvement: family plus topology stratification may be useful, e.g.
balancing family + arity or family + multiplicity pattern. The current
policy keeps that as an audit dimension rather than a hard selection rule.
Important functions:
encode_with_cache(...)select_diverse_reactions(...)select_family_aware_diverse_reactions(...)
select_diverse_reactions(...) is the older SMILES-only picker and is kept for
compatibility. The main pipeline uses
select_family_aware_diverse_reactions(...).
Implemented in main.py.
The main pipeline is:
load RMG database
extract experimental anchors
start MindlessGen worker
generate molecules/radicals
convert to RMG Species
enumerate reactions with RMG react()
deduplicate reactions
accumulate candidate pool
run final family-aware diversity selection
backfill missing/underfilled families from curated seeds
write JSON output
write audit summary
The generated reactions are not accepted batch-by-batch. Instead, the pipeline accumulates a large candidate pool first, then performs one final global selection. This avoids early batches dominating the output.
Checkpoints are written to:
<output_file>.ckpt.json
Use --resume to continue from an existing checkpoint.
The final selector exposes three reaction-family balance controls:
--min-per-family 3
--max-family-fraction 0.15
--family-penalty-alpha 0.75--min-per-family controls minimum attempted family representation, counting
anchors that are already selected. After final selection, targeted backfill also
tries to lift underfilled families to this count using curated seed reactions.
--max-family-fraction controls the cap for any one family during final fill.
For a 500-reaction dataset and 0.15, the cap is about 75 reactions per family.
The cap can relax only if all remaining candidates are already capped.
--family-penalty-alpha controls how strongly common families are penalized.
Higher values push the final dataset toward flatter family balance. Lower
values let DRFP chemical diversity dominate more strongly.
Recommended starting values:
--min-per-family 3 \
--max-family-fraction 0.15 \
--family-penalty-alpha 0.75For stricter family balance:
--min-per-family 5 \
--max-family-fraction 0.10 \
--family-penalty-alpha 1.25For more chemistry-first selection:
--min-per-family 2 \
--max-family-fraction 0.25 \
--family-penalty-alpha 0.25Each run writes:
<output_file>.summary.json
The summary is designed to support paper-methods defensibility. It includes:
- dataset size and candidate-pool size
- seed and reproducibility policy
- hardcoded family-support contract metadata
- selection parameters
- selection audit details:
- minimum-coverage picks
- weighted-diversity picks
- cap-relaxed picks
- hard family cap
- selected family counts
- generated-reaction rejection counters:
- constraint rejects
- duplicate rejects
- SMILES conversion rejects
- final family counts
- source counts
- arity counts
- multiplicity-pattern counts
- missing target families
- target families absent from the loaded RMG database
- families still below
--min-per-family
This is the main artifact to inspect before claiming the dataset covers the ARC benchmark target surface.
Install the Python dependencies in an environment that can import RDKit, DRFP, scikit-learn, and MindlessGen:
uv syncRMG-Py is usually installed separately, often in a conda environment. The main
process must be able to import rmgpy.
The first RMG-database load writes a pickled cache to
.arcbench_cache/rmg_database/. Later runs reuse that cache when the selected
RMG-database families/libraries, kinetics and thermo depositories, thermo
groups, forbidden structures, loader options, Python version, RMG-Py version,
and database git identity hash to the same key. Set
ARCBENCH_RMG_CACHE_DIR=/path/to/cache to share a cache location, or
ARCBENCH_RMG_DISABLE_CACHE=1 to force a fresh load.
Run:
python main.py /path/to/RMG-database out.json \
--cpus 8 \
--batch-size 30 \
--seed 42 \
--checkpoint-every 5 \
--anchors 100By default, the MindlessGen worker is launched with:
uv run --python 3.12 pythonOverride this if MindlessGen lives in another environment:
python main.py /path/to/RMG-database out.json \
--mindlessgen-cmd "conda run -n mindlessgen --no-capture-output python"or:
export MINDLESSGEN_CMD="conda run -n mindlessgen --no-capture-output python"
python main.py /path/to/RMG-database out.jsonUse --seed for deterministic run-level seeding:
python main.py /path/to/RMG-database out.json --seed 42When --seed is set, the main process seeds Python and NumPy, and each
MindlessGen worker request receives a deterministic per-iteration seed:
worker_seed = run_seed + iteration * 1000003
Within each worker request, base-molecule generation also receives deterministic per-molecule seeds so multiprocessing scheduling should not change the intended random sequence.
For paper reproduction, treat the generated artifacts as canonical:
out.json
out.json.summary.json
Archive both with the paper or benchmark release. The seed improves repeatability, but dependency versions, MindlessGen, RDKit, RMG-Py, and the RMG-database snapshot can still affect exact generated molecules and reactions. The fixed benchmark should therefore be the archived output dataset plus audit summary, not just the command line.
submit.sh is an example PBS job script. It:
- requests 60 CPUs and 100 GB memory
- activates an RMG conda environment
- sets
MINDLESSGEN_CMDto a separate MindlessGen environment - writes live logs through
teetojob.outandjob.err - runs
main.pyagainst an RMG-database checkout
Submit with:
qsub submit.shEach output entry is a JSON object similar to:
{
"smiles": "[CH3].O=O>>CO[O]",
"source": "mindless_novel",
"rxn_obj": "CH3 + O2 <=> CH3OO",
"multiplicity_reactants": [2, 3],
"multiplicity_products": [2],
"family": "R_Addition_MultipleBond",
"is_anchor": false,
"kinetics_source": null
}Sources:
rmg_library: experimental/library anchor from RMGmindless_novel: generated molecule/radical followed by RMG enumerationarc_seed: targeted backfill from curated family seed SMILES
The pipeline may create:
rmg.log RMG warnings/errors redirected away from job.err
.drfp_cache.pkl persisted DRFP fingerprint cache
<output>.ckpt.json resumable checkpoint
<output>.summary.json family/diversity/coverage audit
job.out / job.err PBS run logs when using submit.sh
The final dataset can exceed 500 reactions slightly if targeted family backfill adds missing families after the main selection.
Family coverage depends on the local RMG-database checkout and the curated seed coverage. Some target families may not exist in a given database snapshot, and some may not have available library anchors or seed reactions. These cases are reported in the audit summary rather than hidden.