Fig 1 Convolution computing array for DwC mode.
For a DwC with both input and output channel parallelism of l ,
only l groups of PEs can be used simultaneously when implemented
on traditional CNN accelerators, while the rest of the PEs in the
computation array will be idle. Therefore, to ensure that the
computation engine shares the same computation array for both DwC and
standard convolution modes, and to achieve high resource utilization
efficiency during computation, an additional control module is used to
manage the depth convolution mode. As shown in Fig 1, the standard
convolution computation array is divided into q image processing
units (PPE). Each PPE reads l different channels of input data
and the corresponding convolution kernel weights from the buffers. Each
PPE also contains l groups of window processing units (WPE) for
multiplication. Each WPE performs parallel computation on a single
sliding window. Since the depth convolution layer has a uniform 3×3
kernel size, the parallelism of PEs in each WPE is p =9.
In each clock cycle, the computation array first reads l ×pconvolution weights from the weight buffer for each column. When the
input feature map channels remain the same, the same batch of
convolution kernel weights can be used for each convolution operation,
so the weights can be loaded only once. Then, p pixels of a
single convolution window from l input channels of qdifferent images are read from the input buffer and loaded to different
PEs for convolution. Therefore, the number of parallelizable
multiplications is q ×l ×p . The convolution window is
traversed by prioritizing the entire feature map before moving on to the
next set of channel feature map computations, until all qdifferent input images have completed the convolution. Besides, the size
of the convolution array is determined to be 27×27. The parallelism for
both standard convolution and DwC is equal, ensuring that both
convolution modes can fully utilize the hardware resources of the
computation module without wasting.
After the computation is completed in the convolution array, the feature
map data enters the post-processing module for further processing,
including the addition tree module, activation module, pooling module,
and channel shuffle module. The hardware accelerator architecture is
shown in Fig 2. After all the computation is finished, the output
feature map data would be sent to the output buffer and then returned to
the BRAM for the next round of computation.