2.1 SEM Data Augmentation by StyleGAN2-ADA
GANs are a class of generative network models typically consisting of two competing players: a generator network \(G\) and a discriminator network \(D\) (Goodfellow et al., 2014). The generator \(G\) tries to synthesize fake samples \(G(z)\) from the input random noise\(z\sim p_{z}(z)\) to fool \(D\) by mimicking real samples\(x\sim p_{\text{data}}(x)\), while \(D\) conversely aims to distinguish between real samples \(x\sim p_{\text{data}}(x)\) and fake samples \(G(z)\). The notation \(y\sim p(y)\) indicates that the random variable \(y\) is distributed according to the probability distribution\(p(y)\). The two networks \(G\) and \(D\) are trained in an adversarial manner, which is equivalent to playing a minimax game with a loss function \(\mathcal{L}_{\text{GAN}}\left(D,G\right)\) given by
\begin{equation} \operatorname{}{\operatorname{}{\mathcal{L}_{\text{GAN}}(D,G)}}=E_{x\sim p_{\text{data}}(x)}\left[\log{D\left(x\right)}\right]+E_{z\sim p_{z}(z)}\left[\log\left(1-D\left(G\left(z\right)\right)\right)\right].\ \ \ \ \ \ \ \ \ \ (1)\nonumber \\ \end{equation}
With such an adversarial training scheme, GANs have the ability of producing high quality sharp images, outperforming approaches based on pixel-wise mean square error (MSE) loss (Goodfellow et al., 2014).
Karras et al. (2018, 2019) developed StyleGAN and StyleGAN2, two variants of GANs, which are powerful in high-resolution images generation. They can control not only the style (global features) of the image at different scales but also stochastic details (local features). Since StyleGANs are unsupervised, they are well-suited for datasets without conditional labels, which are often common in real applications. To address the long-standing challenge in GANs of training with small dataset, (in which case the discriminator will quickly be overfit to the training samples resulting in divergence of the training), Karras et al. (2020) improved StyleGAN2 by introducing an adaptive discriminator augmentation (ADA) mechanism, which is called StyleGAN2-ADA. With such an augmentation mechanism, the style-based GANs perform well with several thousand or even only several hundred training samples. A detailed description of ADA can be found in the Supporting Information S1.
StyleGAN2 consist of two components: a mapping network and a synthesis network. The goal of the mapping network is to encode the input random noise \(\mathbf{z}\mathbb{\in Z}\) into a set of intermediate vectors\(\mathbf{w\in}\mathbb{W}\) using fully connected layers. Each intermediate vector \(\mathbf{w}\) is further transformed to produce a style scalar \(s\) by an affine transformation \(\mathbf{A}\). The major benefit of the mapping network is that it is helpful to disentangle the latent representation and therefore makes our model easier to be interpreted. Then, the synthesis network incorporates the styles \(s\)via a weight demodulation operation to generate the artificial images starting from low resolution (4×4) and continuing to higher resolution (8×8, 16×16, …, 1024×1024) by convolutional layers. The shallower the layer in which the style \(s\) is incorporated, the coarser the level of details is affected. For example, the first few styles affect the coarse level of details (4×4), while the last few styles affect the fine level of details (1024×1024). This architecture enables the StyleGAN2 to control the global features of images at different scales. To control stochastic variations of generated images at different levels of details, random noise after another affine transformation\(\mathbf{B}\) is injected to the feature maps of the convolutional blocks. It allows the generator to only change the local features, leaving the overall styles and high-level details intact. A detailed network architecture of StyleGAN2 can be found in the Supporting Information Figure S2.