mechanism. Although Transformer-based video algorithms has achieved advanced accuracy results, its memory needed and computational overhead are large, thus the subsequent research are mainly focusing and trying to reduce memory costs and computational complexity.
Methods: A new framework for video human action recognition is proposed which does not require additional extraction of optical flow information, neither uses 3D convolution kernels and numerous attention modules. In addition, the method of using three dimensions of time, space, and channels to obtain more discriminative capabilities is developed. The time and spatial characteristics and action features of human behavior are stimulated and integrated in the three dimensions of time, space and channel to form more discriminative spatio-temporal action features, so that the computational network model can simultaneously process time and spatial features and action features, thereby improving the calculation effectiveness.
The main computational process is demonstrated in Figure 1, which is based upon deep learning model by coarse and fine time-granularity combining motion salience and multi-dimensional excitation.
  1. Based on the dual-stream time segmentation deep learning network, the video is divided into 𝑇 segments with equal time intervals and no overlap, 𝑉 = {𝑉1, 𝑉2, 𝑉3, β‹― , 𝑉𝑇 }, and sample one frame as key-frame, called πΎπ‘“π‘Žπ‘šπ‘’, in each segment, πΎπ‘“π‘Žπ‘šπ‘’ = {𝐾1, 𝐾2, 𝐾3, β‹― , 𝐾𝑇 } and sample 𝑛 frames non-key-frame, as π‘πΎπ‘“π‘Žπ‘šπ‘’.
  2. By using time-domain Fourier transform, the pixel changes in video frames caused by action changes in the time dimension are utilized to obtain motion salient regions and generate motion salient maps. And by this graph, the original key-frames πΎπ‘“π‘Žπ‘šπ‘’ and non-key-frames π‘πΎπ‘“π‘Žπ‘šπ‘’ are achieved, to highlight the significant motion regions where human behavior is located and weaken irrelevant background information.
  3. The feature extraction method of spatio-temporal differential based on coarse-fine time granularity is used to model the action explicitly. The features of the results from time and spatial differentiation on the coarse time granularity is taken to represent the long-term action changes between video segments. More refined action changes are represented by the results from spatio-temporal differentiation on the fine time granularity. Finally, integrating the long-term and the short-term action changes.
  4. 2D CNN is used to extract video spatial information and generate appearance features, and then the appearance features are integrated with the action features, thus the new spatio-temporal characteristics of videos with stronger expressive abilities are constructed jointly.
  5. In order to reduce computational complexity while ensuring the recognition accuracy of the algorithm, a deformable convolution based action feature excitation method (DCME) is used to excite the motion information in time and spatial action features. It utilizes two deformable convolutions of different scales to perform feature differentiation on adjacent key-frames in the complete video, obtaining excitation parameters that can enhance action features, thereby generating video level spatio-temporal action features.
  6. Due to the varying importance of different channel features, a spatio-temporal channel excitation method based on time correlation is used to obtain temporal and spatial information of the channel dimension, and the two are fused to generate the time and spatial features of the channel dimension. Combine it with the video level action features of the time and space generated by action motivation methods to jointly construct stronger ability to describe action features with richer action and information.
In summary, the difference between our presented algorithm and the existing 2D CNN methods that integrate spatio-temporal and motion features lies in: (1) the introduction of motion salience which making this type of 2D CNN method more robust for recognizing human action in complex scene videos; (2) By combining the interaction information between frames within video segments and between frames within video segments, the time and spacial features and action features were built and fused at coarse and fine time granularity, respectively. A more accurate model of spatio-temporal action features at the video segment level was carried out in the time dimension; (3) Adopting two feasible variable convolutions with different scales to stimulate the actions features with diverse spatial scale changes and irregular deformations, and meanwhile, the features ae combined with the characteristics in the channel dimension, so the spatio-temporal action characteristics are stimulated in both spatial and channel dimensions.
Data Set: Four popular benchmarks in video recognition are used in the data experiments of this paper, UCF101 [9] (101 categories of actions and 13320 video clips), HMDB51 [10] (51 action categories and 6766 video clips), Kindetics-400 [11] (400 action categories, each containing at least 400 video clips), and Something Something [12] (with two versions, SSV1 and SSV2, both containing 174 action categories). All the data are divided into training and testing sets using the official designated division rules. The scene related datasets are UCF101, HMDB51, and Kinetics-400, while the time related datasets are Something-Something.
Computational Process and Prameters: Using Ubuntu 18.04, Intel Xeon E5-2620-2.10GHz CUP, 64GRAM, equipped with 4 GeForce GTX 1080 Ti GPUs, using Python, using a deep learning framework of Pytorch 1.8+CUDA 11.0.
Training phase. Firstly, the video was first divided into equal interval 𝑇 segments, and then randomly sample 5 frames in each segment, selecting the 4th frame as the key-frame. The short edge size of these frames is fixed at 256, and then they are adjusted to 224 x 224 using random cropping and core cropping as inputs to the network. Therefore, the input of the network is [𝑁, 𝑇, 3, 224, 224], where 𝑁 is the batch size and 𝑇 is the number of segments or key-frames for each video. In this experiment, 𝑇 is set to 8 or 16.
ResNet50 is taken as the backbone network and initialize the computation with a pre-trained model on ImageNet. The learning rate starts from 0.01, and a small batch stochastic gradient descent algorithm (mini batch SGD) with a momentum of 0.9 is used, with a weight decay rate of 0.0001. Adjustments are made at the 30th, 40th, and 50th epochs, resulting in a total of 70 epochs being trained.
Testing phase. Two testing strategies are used to balance recognition rate of accuracy and efficiency. One strategy is core drop/1 clip, which cuts the core of each sampled video frame to 224 x 224 as network input. However, due to the randomness and sparsity of sampling, this strategy results in low recognition accuracy. Another sampling strategy with high recognition accuracy is 3βˆ’π‘π‘Ÿπ‘œπ‘/10βˆ’π‘π‘™π‘–π‘π‘ , which means each video is sampled 10 times, and the resulting video frames are cropped 3 times as inputs to the network. All recognition results are averaged to obtain the finals. This strategy obtains more comprehensive information due to multiple sampling and cropping, resulting in higher recognition accuracy but lower recognition efficiency.
The evaluation indicators used in this paper are divided into efficiency and accuracy, among which the efficiency evaluation indicator refers to the number of floating-point calculation (FLOPs), which is the computational complexity of the model. Evaluate the recognition accuracy rate of the network using the average accuracy (A). According to the different discrimination rules for correctly classified positive samples, they are divided into average accuracy (π‘šπ΄), Top-1 accuracy (π΄π‘‘π‘œπ‘1) (the principle for discriminating positive samples is that the classification result with the highest probability in the prediction result is the correct classification result), and Top-5 accuracy (π΄π‘‘π‘œπ‘5) (the principle for discriminating positive samples is that the first five classification results with the highest probability in the prediction result contain the correct classification result).
Ablation Experiment : The goal of the ablation experiment is to improve the effectiveness of the algorithm. The original video frames and the video frames with prominent motion regions were used as input for action feature extraction, and experimental verification was conducted on four experimental datasets, as shown in Table 2 (the recognition accuracy of UCF101 and HMDB51 was obtained when the video segment number was 16, while the other two datasets were obtained when the video segment number was 8).
Table 1: Validation of the validity of the motion-significant map was used.