1. Introduction
Chemical reaction selection and design play a key role in drug and
material synthesis.1 The synthesis conditions
(temperature, pressure, solvents, etc.), time and yield of the product
can be greatly optimized through selecting an appropriate chemical
reaction pathway. Therefore, the design of evaluation indicators for
Computer Aided Synthesis Planning (CASP) has evolved in recent
years.2-7 CASP evaluation indicators are mainly
divided into two catalogs: expert knowledge-based evaluation
indicators8,9 and synthesis
complexity/accessibility-based evaluation
indicators.10-17 For expert knowledge-based evaluation
indicators, the rank of synthesis results are determined by
experts.8,9 Although this kind of methods have a high
confidence level, it still suffers from ambiguity and lack of
objectivity. So, it is difficult to be applied in retrosynthesis tasks
and provide an objectivity guidance on synthesis
route.6 For synthesis complexity/accessibility-based
evaluation indicators, the feasibility of synthesis is qualified by
molecular structures and the reaction relationship between reactants and
products.10-17 Although synthesis
complexity/accessibility evaluation indicator eliminates the ambiguity
and objectivity problems, the influence of reaction agents and
conditions are still unable to be considered in these indicators.
Here, a brief overview of existing synthesis
complexity/accessibility-based evaluation indicators is given.
SAscore13 uses Extended Connectivity Fingerprints
(ECFPs) 18 fragment analysis obtained from the
compounds of PubChem database19. According to the
frequency of each fragment occurrence, each fragment is assigned a
numerical score. After combining the fragment score with the penalty for
complexity and the bonus for symmetry, SAscore is able to measure
compound synthesis accessibility on a high-throughput scale. SAscore is
widely used in guiding synthesis directions in
retrosynthesis.20,21 Based on the assumption that the
complexity of the reactants is lower than products, a data-driven metric
SCscore14 was designed to describe real syntheses.
Trained by 22 million reactant-product pairs from the
Reaxys22 database, SCscore is able to describe the
complexity of the synthetic route.4 Although this
evaluation metric differs from the metric of synthetic accessibility, it
can also be used as a guide for retrosynthesis through the Morgan
Fingerprints input. SYBA15 is a fragment-based method
for the rapid classification of the synthesis difficulty of organic
compounds. It uses Bernoulli Naïve Bayes classifier to assign SYBA score
contributions to individual fragments based on their frequencies in the
database of easy- (ES) or hard-to-synthesize (HS) molecules. Although it
can be used to quickly rank large molecular datasets for high-throughput
screening or molecular design, it still cannot compete with more
sophisticated synthetic path reconstruction methods that enable the
incorporation of other factors23. RAscore and GASA are
the evaluation metrics using a similar method in retrosynthesis
accessibility.16,17 Machine Learning (ML) is used in
these methods to generate the probability of retrosynthesis
accessibility. The data-driven models of RAscore and GASA were trained
by using ES or HS labels generated by multistep retrosynthetic planning
algorithm such as Retro*24 and
AiZynthFinder.25 Although these developed evaluation
metrics are able to clearly determine the difficulty of molecular
synthesis, the impact of reaction agents is still unable to be
considered.
With the development of ML, Graph Neural Networks (GNN) are gradually
used in chemistry. In addition to predicting molecular thermodynamic
properties in the dataset such as QM9,26-29 it has
also been used in molecular generation,30,31reinforcement learning for molecular design,32molecular representation learning33-37 and reaction
yield prediction38 in recent years. For the molecular
representation learning method, SMILES Contrastive LeaRning (SMICLR)
framework was proposed which embraces multimodal molecular data. It
jointly trains a graph encoder and SMILES encoder to perform the
contrastive learning. Through data augmentation on graphs and SMILES
sequences, SMICLR model successfully reduced the prediction error for
the energetic and electronic properties of the QM9
dataset.33 MolCLR is a self-supervised learning
framework which performs graph data augmentation and contrastive
learning method on a large unlabeled molecular database to achieve
representation learning of molecules. Benefiting from pre-training on a
large unlabeled database, MolCLR even achieves state-of-the-art results
on several challenging benchmarks after fine-tuning.34GeomGCL designs a novel geometric graph contrastive scheme to enable
collaborative supervision between 2D and 3D molecular graph geometric
views, aiming to improve model generalization ability on molecular graph
classification and regression.35 MoCL is a contrastive
learning framework which utilizes domain knowledge at both local and
global levels to learn molecular representations. By replacing valid
substructures with bioisosteres that share similar properties, MoCL
achieves accurate prediction of molecular properties, providing a
suitable and powerful augmentation method for molecular
graph.36 KCL builds a knowledge graph data
augmentation module by using fundamental chemical attributes to connect
atoms that are not directly connected by bonds.37 By
using a double MPNN model, extensive experiments demonstrated that KCL
obtained superior performance against state-of-the-art baselines on
eight molecular datasets, demonstrating the feasibility of the framework
for molecular representation learning. In summary, contrastive learning
method shows a better performance on molecular properties prediction. It
illustrates that contrastive learning method is able to help the model
extract more features and improve prediction effect of molecular
properties.
In this work, we migrate the generation method of molecular synthesis
accessibility to reaction superiority and design a reaction total
atom-atom mapping algorithm to complement the atomic mapping
relationship in the chemical reaction database. By using the reaction
descriptors constructed from the reaction mapping relationships and
reaction reagents, a chemical reaction representation learning model is
constructed through a contrastive learning method. After fine-tuning the
model on a binary classification task for determining reaction
superiority, reaction superiority score (RSscore) is generated to
evaluate the superiority of chemical reactions and further applied on
reaction evaluation and synthesis route analysis.