For ab initio calculations using the cc-pVTZ basis sets, relativistic effective core potentials were not available for molecules containing iodine. Thus, for comparisons with DLPNO-CCSD(T) and RI-MP2 methods, such species were omitted. Similarly, the ANI-1x and ANI-1cxx methods only support molecules containing CHON atoms and evaluations were only performed on the subset of molecules supported. The ANI-2x method supports additional elements, but not bromine or iodine and thus evaluations were similarly only performed on the supported subset for that method.
For bag of feature ML testing, the training set was five conformers of each molecule, with the remaining conformers as test/validation. Any molecule with fewer than five conformers had the conformers added to the training set and was omitted from the test set.
Results
In this work, we focus on the evaluation of single point atomization energy calculations on a subset of ~700 organic molecules. Conformers were initially created from a set of 250 diverse poses with maximal heavy-atom root mean squared deviation (RMSD) using Open Babel, and at most 10 poses were selected based on the lowest heat of formation calculated by PM7, followed by full geometry optimization using B3LYP-D3BJ with the def2-SVP basis set.\cite{Kanal_2017}
Using this set of DFT-optimized minima, in this work, single point atomization energies were computed using the DLPNO-CCSD(T)
\cite{Liakos_2015,Guo_2018} method using the cc-pVTZ basis set.
(Dunning 1989, Kendall 1992) This approach has been found to be a highly accurate method for calculating thermochemical properties and with a significantly lower computational cost for medium to large organic molecules, compared to canonical CCSD(T) methods.
\cite{Paulechka_2017,Liakos_2019,Liakos_2015} U
sing only the set of molecules in which all standard (i.e., not machine-learning based) methods completed leaves 6511 entries. Of those, 9 molecules (out of 690) had 2 or fewer poses and were also removed, leaving 681 unique molecules and ~6500 entries for comparison.To our knowledge, this is the most extensive computational validation set, both in terms of the number of compounds, geometries, and computational methods for studying low energy molecular conformers. We provide all data and analysis scripts as open data and open source to allow future reuse via a
GitHub repository.