The core of the accelerator is a highly parallel array that supports both standard convolution and DwC. For standard convolution, we employed channel-level parallelism in the convolution. In one clock cycle, each row of the convolution array reads the convolution window at the same position from l input channels, while each column reads the convolution kernel weights from m output feature maps. The array uses a total ofl × m PEs to perform the element-wise multiplication. The input feature maps and weights are accessed in address order, a parallel approach suitable for natural data storage patterns. Moreover, only a single read operation is required for the same data within the same clock cycle, resulting in reduced bandwidth requirements. AfterDk × Dk (kernel size) cycles, a set of convolution results is obtained. Then, the sliding window moves to the next position to traverse the entire feature map, and the computation continues for the next input feature map channel. This process repeats until all m output feature maps have completed the convolution, and then the calculation starts for the next group ofl × m channel dimensions of the feature maps.