2.3 Construction of reaction superiority classification model

The reaction classification model is constructed to determine reaction superiority for reaction pathway design/selection. According to the above reaction superiority constraints, most of reaction data are unlabeled. Thus, the model trained only by labelled data may not have good generalization capability on unknown reactions. Therefore, it is necessary to utilize unsupervised methods to extract differences between reaction graph data structures for representation learning. In this work, the unlabeled data is used to construct a reaction superiority classification model through the pre-training and fine-tuning method, enhancing both prediction accuracy and generalization ability.

2.3.1 Reaction graph data augmentation algorithm

According to the model structure in Fig.1(c) , data augmentation plays an important role in contrastive learning which will directly affect the model training. Since most of the traditional graph data augmentation algorithms34 would destroy the reaction or molecule structure and the rationality of reaction representation36, a data augmentation algorithm is proposed to generate reactions with similar properties under the assumption of ignoring the carbon atom not directly connected to the reacted atoms by ignoring steric effect. The main operation of this algorithm is adding or deleting the carbon atom not directly connected to the reacted atoms. By associating this method with feature masking augmentation, the diversity of samples is increased without changing the rationality of the reaction representation. The algorithm is divided into three steps.
Step 1: Carbon addition and subtraction. Under the assumption of ignoring the influence of steric effect caused by the carbon atoms not directly connected to reacting atoms, the addition or deletion of straight chains with fewer than 3 carbon atoms to non-reactive atoms will not affect the reaction. Following the identification of manipulatable atomic nodes, carbon atoms are added to or removed from the reactant and product molecules using Rdkit41, generating novel molecular structures.
Step 2: Mapping of reaction atoms index. Since there is no mapping index number for the carbon or hydrogen atoms in the newly added straight chain, we supplement the mapping relationships of the newly added atoms in the reactions based on the neighbor information of the modified atoms, ensuring correct mapping relationship among reaction atoms.
Step 3: Molecule and atom mapping index standardization. After completing all atoms mapping index in the reaction, the molecular and atomic numbers are normalized to make the resulting reaction SMARTS representations more rigorous. By reordering all the mappings index and renormalizing the molecules SMARTS, a standardized SMARTS representation of the reaction is generated to facilitate the generation of augmented reaction graph for contrastive learning.

2.3.2 Pre-training model with contrastive learning

In the pre-training process, the contrastive learning model is utilized to train the initial parameters of the backbone model using a large number of unlabeled reaction data. Negative sample-based comparison methods are often applied for molecular contrastive learning models in recent years33-37. Here, the SIMCLR42 structure contrastive learning model is used. The SIMCLR structure contains three parts, namely encoder, decoder and loss function.
The encoder part is constructed by the backbone model designed for extracting features from reaction graphs. To maximum characterization capability of GNN for graph classification, the Graph Isomorphism Network with Edge features (GINE)43 constructed in DGL44 is selected as the backbone propagation module for contrastive learning. With MLP node feature updating function, the inclusion of the AIR residual45 and the acquisition of reaction summary node features as the readout function, the reaction summary node feature provides a unique representation of chemical reactions. The formula for GINE message passing is shown in Eq. (1).