Fig 1  Convolution computing array for DwC mode.
For a DwC with both input and output channel parallelism of l , only l groups of PEs can be used simultaneously when implemented on traditional CNN accelerators, while the rest of the PEs in the computation array will be idle. Therefore, to ensure that the computation engine shares the same computation array for both DwC and standard convolution modes, and to achieve high resource utilization efficiency during computation, an additional control module is used to manage the depth convolution mode. As shown in Fig 1, the standard convolution computation array is divided into q image processing units (PPE). Each PPE reads l different channels of input data and the corresponding convolution kernel weights from the buffers. Each PPE also contains l groups of window processing units (WPE) for multiplication. Each WPE performs parallel computation on a single sliding window. Since the depth convolution layer has a uniform 3×3 kernel size, the parallelism of PEs in each WPE is p =9.
In each clock cycle, the computation array first reads l ×pconvolution weights from the weight buffer for each column. When the input feature map channels remain the same, the same batch of convolution kernel weights can be used for each convolution operation, so the weights can be loaded only once. Then, p pixels of a single convolution window from l input channels of qdifferent images are read from the input buffer and loaded to different PEs for convolution. Therefore, the number of parallelizable multiplications is q ×l ×p . The convolution window is traversed by prioritizing the entire feature map before moving on to the next set of channel feature map computations, until all qdifferent input images have completed the convolution. Besides, the size of the convolution array is determined to be 27×27. The parallelism for both standard convolution and DwC is equal, ensuring that both convolution modes can fully utilize the hardware resources of the computation module without wasting.
After the computation is completed in the convolution array, the feature map data enters the post-processing module for further processing, including the addition tree module, activation module, pooling module, and channel shuffle module. The hardware accelerator architecture is shown in Fig 2. After all the computation is finished, the output feature map data would be sent to the output buffer and then returned to the BRAM for the next round of computation.