a Enter the raw video frame
b Input to video frames enhanced using
motion-significant maps
It can be noticed that the method with the processing of motion salience
maps improves its accuracy rate significantly on scene related datasets,
such as UCF101 increased from 94.9% to 96.8%, HMDB51 increased from
71.2% to 73.5%, and Kinetics-400 increased from 75.1% to 77.2%.
However, the time related and simple background dataset SSV1 does not
show significant improvement in recognition accuracy. The main purpose
of using motion salience maps is to reduce the interference of
irrelevant background information. Since scene related datasets have
cluttered background, the process of highlighting motion salience areas
could help the computational network to focus on human action, and so
the obtained results are more accurate.
Comparison with State-of-the-art: Compare our presented method
with three basic 2D CNN methods TSN [3], TSM [13], and TEA
[14], using TSN as the benchmark (almost all the advanced and
popular 2D CNN based methods are all originated from TSN).
The experimental results showed that our presented method had better
recognition accuracy rate than TSN and TSM on all four datasets, however
the improvement compared to TEA [14] was relatively small or almost
unchanged. The reason is that TSN [3] does not include modules that
can describe temporal information and lacks the information of motion.
Therefore, for scene related datasets UCF101, HMDB51, and Kinetics-400,
its recognition accuracy rate is only slightly lower than the other
three methods, while for time related datasets SSV1, it lags far behind.
Table 2: Comparison Results between our presented method and
Three Benchmark 2D CNN Methods (The recognition accuracy of UCF101 and
HMDB51 in the table was obtained when the video segment number was 16,
while the other two datasets were obtained when the video segment number
was 8.)