2.1 SEM Data Augmentation by StyleGAN2-ADA
GANs are a class of generative network models typically consisting of
two competing players: a generator network \(G\) and a discriminator
network \(D\) (Goodfellow et al., 2014). The generator \(G\) tries to
synthesize fake samples \(G(z)\) from the input random noise\(z\sim p_{z}(z)\) to fool \(D\) by mimicking real samples\(x\sim p_{\text{data}}(x)\), while \(D\) conversely aims to
distinguish between real samples \(x\sim p_{\text{data}}(x)\) and fake
samples \(G(z)\). The notation \(y\sim p(y)\) indicates that the random
variable \(y\) is distributed according to the probability distribution\(p(y)\). The two networks \(G\) and \(D\) are trained in an adversarial
manner, which is equivalent to playing a minimax game with a loss
function \(\mathcal{L}_{\text{GAN}}\left(D,G\right)\) given by
\begin{equation}
\operatorname{}{\operatorname{}{\mathcal{L}_{\text{GAN}}(D,G)}}=E_{x\sim p_{\text{data}}(x)}\left[\log{D\left(x\right)}\right]+E_{z\sim p_{z}(z)}\left[\log\left(1-D\left(G\left(z\right)\right)\right)\right].\ \ \ \ \ \ \ \ \ \ (1)\nonumber \\
\end{equation}With such an adversarial training scheme, GANs have the ability of
producing high quality sharp images, outperforming approaches based on
pixel-wise mean square error (MSE) loss (Goodfellow et al., 2014).
Karras et al. (2018, 2019) developed StyleGAN and StyleGAN2, two
variants of GANs, which are powerful in high-resolution images
generation. They can control not only the style (global features) of the
image at different scales but also stochastic details (local features).
Since StyleGANs are unsupervised, they are well-suited for datasets
without conditional labels, which are often common in real applications.
To address the long-standing challenge in GANs of training with small
dataset, (in which case the discriminator will quickly be overfit to the
training samples resulting in divergence of the training), Karras et al.
(2020) improved StyleGAN2 by introducing an adaptive discriminator
augmentation (ADA) mechanism, which is called StyleGAN2-ADA. With such
an augmentation mechanism, the style-based GANs perform well with
several thousand or even only several hundred training samples. A
detailed description of ADA can be found in the Supporting Information
S1.
StyleGAN2 consist of two components: a mapping network and a synthesis
network. The goal of the mapping network is to encode the input random
noise \(\mathbf{z}\mathbb{\in Z}\) into a set of intermediate vectors\(\mathbf{w\in}\mathbb{W}\) using fully connected layers. Each
intermediate vector \(\mathbf{w}\) is further transformed to produce a
style scalar \(s\) by an affine transformation \(\mathbf{A}\). The major
benefit of the mapping network is that it is helpful to disentangle the
latent representation and therefore makes our model easier to be
interpreted. Then, the synthesis network incorporates the styles \(s\)via a weight demodulation operation to generate the artificial images
starting from low resolution (4×4) and continuing to higher resolution
(8×8, 16×16, …, 1024×1024) by convolutional layers. The shallower
the layer in which the style \(s\) is incorporated, the coarser the
level of details is affected. For example, the first few styles affect
the coarse level of details (4×4), while the last few styles affect the
fine level of details (1024×1024). This architecture enables the
StyleGAN2 to control the global features of images at different scales.
To control stochastic variations of generated images at different levels
of details, random noise after another affine transformation\(\mathbf{B}\) is injected to the feature maps of the convolutional
blocks. It allows the generator to only change the local features,
leaving the overall styles and high-level details intact. A detailed
network architecture of StyleGAN2 can be found in the Supporting
Information Figure S2.