Fig. 1: The overview of RSscore model framework

2.2 Dataset preparation

USPTO in ORD database39 is selected as the data source for generating dataset. Since some reactions in the ORD database do not have complete atomic mapping relationships information (such as known atom matching, as the functional group hydroxyl and hydrogen shown in blue in Fig.1 (a) ), the Reaction Total Atom-Atom Mapping (RTAAM) algorithm are designed to complement atom mapping relations. Moreover, there is no clear boundary between superior and inferior reaction determination, thus label generation and label smoothing are utilized for subsequent model training.

2.2.1 Reaction Total Atom-Atom Mapping (RTAAM) algorithm

In the dataset, unimportant products are not always recorded, which causes not all atoms in the reaction have mapping index. To complement the missing products and the atom mapping relationship, RTAAM algorithm is developed (Fig 1(a)) . The RTAAM algorithm can be divided into three steps.
Step 1: Known atom mapping. The SMARTS of the missing product is determined during the reaction completion process based on the transfer relationship of the atoms. In the ORD database, the known atomic mapping relationships already exist in the reaction SMARTS, therefore the atomic remapping can be performed after the known atomic mapping relationships identification completed. In cases where atom mapping relationships are absent, the rxnmapper40 is used to generate a reasonable mapping to achieve matching of known atoms in the reactions.
Step 2: Atomic remapping. From the atom remapping step for different reaction class in Fig. 2, the reaction class and the atom mapping indexes of the missing product can be determined according to the available atom mapping information and the difference of the bond features between the reactants and products. For substitution reactions and elimination reactions, the SMARTS of the missing product is constructed by connecting the missing atomic mapping numbers of the leaving groups directly. For addition reactions and rearrangement reactions, all products have been already recorded in reaction SMARTS. Therefore, the algorithm only needs to operate on the changes in bonds to infer the atomic mapping relationship. After adding the omitted hydrogen atoms and the corresponding atom mapping relationships, the remapping relationships of reaction atoms can be completed. More details are shown in Supplementary Information .