2.3 Construction of reaction superiority classification
model
The reaction classification model is constructed to determine reaction
superiority for reaction pathway design/selection. According to the
above reaction superiority constraints, most of reaction data are
unlabeled. Thus, the model trained only by labelled data may not have
good generalization capability on unknown reactions. Therefore, it is
necessary to utilize unsupervised methods to extract differences between
reaction graph data structures for representation learning. In this
work, the unlabeled data is used to construct a reaction superiority
classification model through the pre-training and fine-tuning method,
enhancing both prediction accuracy and generalization ability.
2.3.1 Reaction graph data augmentation
algorithm
According to the model structure in Fig.1(c) , data augmentation
plays an important role in contrastive learning which will directly
affect the model training. Since most of the traditional graph data
augmentation algorithms34 would destroy the reaction
or molecule structure and the rationality of reaction
representation36, a data augmentation algorithm is
proposed to generate reactions with similar properties under the
assumption of ignoring the carbon atom not directly connected to the
reacted atoms by ignoring steric effect. The main operation of this
algorithm is adding or deleting the carbon atom not directly connected
to the reacted atoms. By associating this method with feature masking
augmentation, the diversity of samples is increased without changing the
rationality of the reaction representation. The algorithm is divided
into three steps.
Step 1: Carbon addition and subtraction. Under the assumption
of ignoring the influence of steric effect caused by the carbon atoms
not directly connected to reacting atoms, the addition or deletion of
straight chains with fewer than 3 carbon atoms to non-reactive atoms
will not affect the reaction. Following the identification of
manipulatable atomic nodes, carbon atoms are added to or removed from
the reactant and product molecules using Rdkit41,
generating novel molecular structures.
Step 2: Mapping of reaction atoms index. Since there is no
mapping index number for the carbon or hydrogen atoms in the newly added
straight chain, we supplement the mapping relationships of the newly
added atoms in the reactions based on the neighbor information of the
modified atoms, ensuring correct mapping relationship among reaction
atoms.
Step 3: Molecule and atom mapping index standardization. After
completing all atoms mapping index in the reaction, the molecular and
atomic numbers are normalized to make the resulting reaction SMARTS
representations more rigorous. By reordering all the mappings index and
renormalizing the molecules SMARTS, a standardized SMARTS representation
of the reaction is generated to facilitate the generation of augmented
reaction graph for contrastive learning.
2.3.2 Pre-training model with contrastive
learning
In the pre-training process, the contrastive learning model is utilized
to train the initial parameters of the backbone model using a large
number of unlabeled reaction data. Negative sample-based comparison
methods are often applied for molecular contrastive learning models in
recent years33-37. Here, the
SIMCLR42 structure contrastive learning model is used.
The SIMCLR structure contains three parts, namely encoder, decoder and
loss function.
The encoder part is constructed by the backbone model designed for
extracting features from reaction graphs. To maximum characterization
capability of GNN for graph classification, the Graph Isomorphism
Network with Edge features (GINE)43 constructed in
DGL44 is selected as the backbone propagation module
for contrastive learning. With MLP node feature updating function, the
inclusion of the AIR residual45 and the acquisition of
reaction summary node features as the readout function, the reaction
summary node feature provides a unique representation of chemical
reactions. The formula for GINE message passing is shown in Eq. (1).