Fig. 1: The overview of RSscore model framework
2.2 Dataset preparation
USPTO in ORD database39 is selected as the data source
for generating dataset. Since some reactions in the ORD database do not
have complete atomic mapping relationships information (such as known
atom matching, as the functional group hydroxyl and hydrogen shown in
blue in Fig.1 (a) ), the Reaction Total Atom-Atom Mapping
(RTAAM) algorithm are designed to complement atom mapping relations.
Moreover, there is no clear boundary between superior and inferior
reaction determination, thus label generation and label smoothing are
utilized for subsequent model training.
2.2.1 Reaction Total Atom-Atom Mapping (RTAAM)
algorithm
In the dataset, unimportant products are not always recorded, which
causes not all atoms in the reaction have mapping index. To complement
the missing products and the atom mapping relationship, RTAAM algorithm
is developed (Fig 1(a)) . The RTAAM algorithm can be divided
into three steps.
Step 1: Known atom mapping. The SMARTS of the missing product
is determined during the reaction completion process based on the
transfer relationship of the atoms. In the ORD database, the known
atomic mapping relationships already exist in the reaction SMARTS,
therefore the atomic remapping can be performed after the known atomic
mapping relationships identification completed. In cases where atom
mapping relationships are absent, the rxnmapper40 is
used to generate a reasonable mapping to achieve matching of known atoms
in the reactions.
Step 2: Atomic remapping. From the atom remapping step for
different reaction class in Fig. 2, the reaction class and the
atom mapping indexes of the missing product can be determined according
to the available atom mapping information and the difference of the bond
features between the reactants and products. For substitution reactions
and elimination reactions, the SMARTS of the missing product is
constructed by connecting the missing atomic mapping numbers of the
leaving groups directly. For addition reactions and rearrangement
reactions, all products have been already recorded in reaction SMARTS.
Therefore, the algorithm only needs to operate on the changes in bonds
to infer the atomic mapping relationship. After adding the omitted
hydrogen atoms and the corresponding atom mapping relationships, the
remapping relationships of reaction atoms can be completed. More details
are shown in Supplementary Information .