Violence behavior recognition of two-cascade temporal shift module with attention mechanism

Abstract. Violence behavior recognition is an important research scenario in behavior recognition and has broad application prospects in the field of network information review and intelligent security. Inspired by the long-short-term memory network, we estimate that temporal shift module (TSM) may have more room for improvement in the feature extraction ability of long-term information. In order to verify the above conjecture, we explored based on TSM. After many attempts, it was finally proposed to connect the two TSMs in a cascaded manner, which can expand the receptive field of the model. In addition, an efficient channel attention module was introduced at the front end of the network, which strengthened the model’s spatial feature extraction capabilities. At the same time due to behavior recognition prone to over-fitting, we extended and processed on the basis of some open-source datasets to form a larger violence dataset and solved the problem of over-fitting. The final experimental results show that the algorithm proposed can improve the model’s feature extraction ability of violent behavior in the space and temporal dimension and realize the recognition of violent behavior, which verified the above point of view.


Introduction
With the rapid popularization of mobile terminals, the Internet is uploading massive amounts of video data all the time, and these video data are likely to involve violent scenes, which will have an adverse impact on the health of the network environment. In order to maintain social safety and stability, functional departments such as police agencies and security companies have broad application requirements for intelligent video recognition systems in the field of on-duty security. The intelligent recognition of scenes involving violence can promptly feedback emergency security incidents to rear duty personnel, facilitating timely handling of incidents. Therefore, the recognition of violent behavior plays an important role in maintaining the safety and health of society and cyberspace. 1 According to the recognition process, behavior recognition mainly includes three steps: video preprocessing, feature extraction, and behavior classification. 2 According to the method of feature extraction, behavior recognition can be divided into traditional behavior recognition 3,4 and behavior recognition based on deep learning. [5][6][7][8] Traditional behavior recognition methods mainly extract features manually, and the types of features mainly include global features and local features. The global feature extraction mainly includes two methods: silhouette and human joint points. For example, Bobick and Davis 9 established a motion energy map to classify behaviors based on background subtraction. Yang 10 established the three-dimensional contour of the human body for feature extraction by determining the coordinates of the joint points. Local feature extraction mainly includes two feature extraction methods: spatiotemporal interest points sampling and trajectory tracking. For example, the dense trajectory extraction related algorithms dense trajectories and improved dense trajectories proposed by Wang et al. 11 and Wang and Schmid. 12 According to the different feature extraction models, the current common methods of behavior recognition based on deep learning can be divided into three categories: two-stream CNN model, temporal model, and spatiotemporal model. Among them, the two-stream CNN model mainly extracts spatiotemporal information through two parallel channels and uses appropriate channel fusion to achieve behavior classification. For example, Simonyan and Zisserman 13 first proposed the two-stream approach for behavior recognition. Wang 14 adopted the temporal segment network to realize the recognition of long-term motion. Inspired by the two-stream CNN model, Feichtenhofer C 15 designed a lightweight two-stream network Slowfast, which reduces the complexity of the model.
Temporal models mainly rely on recurrent neural networks and their variants to extract temporal information in behavior and convolutional neural networks to extract spatial information. For example, Donahue et al. 16 introduced convLSTM 17 to replace the traditional long-short-term memory (LSTM) to achieve the fusion of spatiotemporal information. Li et al. 18 merged convLSTM with attention LSTM and constructed a new network structure VideoLSTM. The spatiotemporal model mainly uses 3D convolution to extract the spatiotemporal information of behaviors at the same time. In recent years, some scholars have adopted appropriate video preprocessing methods so that the spatiotemporal model can also achieve behavior classification through simple 2D convolution. Ji et al. 19 first applied 3D convolution to video behavior analysis and realized the extraction of spatial and temporal features from the video. Tran et al. 20 integrated on the basis of 3D convolution and proposed to establish convolutional 3D (C3D). C3D realized the use of large-scale video dataset training to learn the spatiotemporal characteristics of video, which improved the generalization ability of related algorithms. The 3D model is implicitly pretrained on ImageNet, and the 3D convolutional pretrained model is obtained in kinetics. Lin et al. 21 proposed the temporal shift module (TSM). By shifting and splicing adjacent frames in the temporal dimension, using 2D convolution to extract spatiotemporal information at the same time, the effect of 3D convolution is realized, and the problems of 3D convolution in parameters and calculations are solved.
However, the long-term information acquired by TSM network during behavior recognition is limited, the network structure is too simple, and over-fitting is prone to occur in the process of feature learning. In order to solve the problems above and also to further improve the accuracy of behavior recognition, this paper improves on the basis of the TSM network and conducts experimental exploration. The main contributions of this paper are as follows.
(1) A simple two-cascade TSM is proposed, which expands the receptive field of temporal dimensions and realizes the enhancement of long-term information extraction capabilities. (2) Introduce the efficient channel attention (ECA) module at the front end of the TSM network to improve the network's feature extraction ability of spatial information to a certain extent and reduce the impact of overfitting on network performance.

Temporal Shift Module
Behavior recognition mainly obtains spatial information and temporal information contained in data during feature extraction. Traditional 3D convolution uses a 3D convolution kernel to perform convolution operations between adjacent multiple frames at the same time, which can extract the spatiotemporal feature information in the video, but it will inevitably lead to an increase in calculation. The TSM uses a simple data preprocessing method to convert the invisible temporal information in a single frame into extractable spatial feature information.
As shown in Fig. 1(a), several adjacent frames of images are stacked to form the original tensor, and the same color in the figure represents the same frame of image. Figure 1(b) shows the TSM. The TSM moves the channels forward and backward in the temporal dimension to perform simple feature fusion between adjacent frames. The fusion makes an independent single frame contain certain temporal information, and simple 2D convolution can be used to achieve spatiotemporal feature extraction. The effect of convolution can be achieved through shift and multiply-accumulate operation, and that 3D CNN can be reduced in dimensionality in this way. For an infinitely one-dimensional vector X and a convolution kernel W ¼ ð w 1 w 2 w 3 Þ, the convolution operation is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 4 7 2 (1) The above equation can also be decoupled by shift and multiply-accumulate operation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 2 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 8 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 3 5 8 Among them, x i represents the element in X, y i represents the result of convolution, X −1 , X þ1 represent the infinite one-dimensional vector shifted back and forth by a unit, and Y represents the sum of the convolution results.

Efficient Channel Attention Module
The structure of the TSM behavior recognition network is too simple, and it is susceptible to interference from background information, causing serious over-fitting. In order to improve the network's feature extraction ability of spatial information, this paper introduces an ECA module. 22 As shown in Fig. 2, for the input tensor, the global average pooling is first performed  without reducing the dimensionality, and then local cross-channel interaction is realized through one-dimensional convolution, and it is activated by the nonlinear function sigmoid. The result of activation is multiplied by the input tensor as the final output. The ECA module realizes local cross-channel interaction through one-dimensional fast convolution with adaptive size, which avoids channel dimensionality reduction and can reduce the interference of background information on feature extraction.

Intuition
The TSM realizes the effective integration of spatiotemporal information in a single frame by performing simple channel shift in the temporal dimension. The shift of temporal dimension is similar to the function of RNN to a certain extent, which can realize the transfer of "memory" at different moments (Fig. 3).
The unidirectional TSM can be expressed mathematically as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 5 3 7 The RNN can be expressed mathematically as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 4 9 2 h ðtÞ ¼ fðuh ðt−1Þ þ wx ðtÞ þ bÞ: Among them, h ðtÞ is the state of the RNN at time t, u and w are the weights of the RNN nodes, and x ðtÞ is the input at time t. Judging from the given network structure and mathematical formulas, there is a certain similarity between TSM and RNN, which is the source of inspiration for our follow-up work.
RNN cannot obtain long-term information when applied to behavior recognition, so some scholars have adopted a variant of RNN, LSTM, 23 to enhance the ability of the model to extract long-term information. Similarly, does the TSM have room for further improvement in the feature extraction capabilities of long-term information? This paper has launched an experimental analysis.

Two-Cascade TSM Residual Module
In order to strengthen the network's feature extraction capability for long-term information, it is simplest to move more channels forward and backward in the temporal dimension of the TSM. Based on the above ideas, this paper attempts to make various improvements to TSM. For example, introducing two temporal shifts in the channel dimension, changing the proportion of two temporal shifts in the tensor, and trying to manually add weights to various shifts. However, a large number of experiments have proved that these changes will not help improve the network's feature extraction ability for long-term information.
The above scheme unilaterally emphasizes the channel shift in the temporal dimension and ignores the overall feature fusion, resulting in the shift of temporal information only limited to the local area of the tensor, which destroys the integrity of the temporal and spatial information to a certain extent. Therefore, when strengthening the shift of temporal dimension, we must also consider the global fusion of spatio-temporal information. The TSM will reshape the data before and after the shift of the temporal dimension. This design is helpful to the integration of original data and shifted data, which is conducive to the global fusion of time and space information. Therefore, on the basis of the TSM behavior recognition network, this paper uses a simple twocascade TSM, which strengthens the model's ability to extract temporal information to a certain extent and also realizes the effective integration of spatial-temporal information.
Similarly, suppose there are an infinite one-dimensional vector X and a convolution kernel W with a size of 1 × 3. Assume that the vector after a shift is Z: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 5 4 4 Among them, α, β, and γ are the weighting factors. Then after two cascades, the convolution result Y is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 4 8 7 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 4 4 3 and E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 4 2 1 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 6 ; 3 7 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 3 5 6 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 3 3 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 3 1 2 w e ¼ γw 3 : Then through inverse decoupling, the following conclusions can be drawn: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 2 8 9 This realizes the convolution operation between the infinite one-dimensional vector X and the new convolution kernel W 0 ¼ ð w a w b w c w d w e Þ. That is to say, without changing the original convolution kernel, a 1 × 3 convolution kernel can achieve a 1 × 5 convolution effect through the simple two cascades.
As shown in Fig. 4(a), based on the residual module, this paper adds two cascaded TSMs before the convolutional layer, forming a two-cascaded TSM residual module. It expands the receptive field of temporal dimension without changing the size of the convolution kernel. The experimental results show that the cascaded TSM independently shifts the temporal information, which improves the fusion of features in the temporal dimension and strengthens the model's feature extraction ability for long-term information. At the same time, the cascaded modules will restructure the shifted tensors and integrate spatiotemporal information before the second shift, avoiding the one-sided and fragmented temporal shift.
As shown in Figs. 4(b) and 4(c), this paper also tries to make more changes on the basis of the two-cascaded TSM, such as introducing short-cut in two TSM and expanding the cascade to three times. However, as shown in Fig. 5, the experimental results on the RWF-2000 dataset show that using different residual modules as the basic unit to construct a ResNet50 24 network for violent behavior recognition, subsequent improvements to the two-cascaded TSM residual module will not help further improve the feature extraction ability of the model. Therefore, this paper chooses a simple two-cascaded TSM residual module as the basic unit to form a ResNet50 network for behavior recognition.

Efficient Channel Attention Module
The TSM network introduces the TSM into the residual module of ResNet50 and realizes the fusion of spatiotemporal information through simple data shift. Behavior recognition can be realized through the 2D convolutional neural network. This paper also uses the two-cascaded TSM as the basic unit to construct a two-cascaded TSM behavior recognition network on the basis of ResNet50. The specific structure is shown in Table 1. If a two-cascaded TSM is used, the two-cascade TSM is recorded as 1 otherwise it is recorded as 0.
This paper attempts to introduce the ECA module directly into the residual module of ResNet50 to form ECANet in the model construction, but the results show that this will greatly increase the amount of model parameters, and it will not help improve the accuracy of recognition.
As shown in Fig. 6, for the input video image F i of the i'th frame, first extract the key information from the data through the attention module to complete the preprocessing of the information, which can reduce the interference caused by the background information to a certain extent. Then a 2D CNN network ResNet50 composed of two-cascaded TSM residual modules is used to realize feature extraction and classification of video frames that incorporate temporal and spatial information.

Dataset
In order to fully test the performance of the algorithm and verify the proposed conjecture, this paper has conducted experiments on three open-source violent behavior recognition datasets and an expanded new dataset.
The crowd violence 25 dataset contains 246 video clips with a duration of 1.04 to 6.52 s, with an average duration of 3.6 s. This dataset mainly depicts scenes of crowd violence, but the scenes are relatively vague. The hockey dataset contains 1000 violent and non-violent videos collected from ice hockey game. The training set includes 800 video clips, and the validation set includes  200 video clips. The main content of the video is the violent actions in the ice hockey game. Each video is 2 s and contains 41 frames. Since the hockey dataset has a small number of videos, a single scene, and limited application value, it is difficult to meet the needs of deep neural network learning, so this paper introduces the latest RWF-2000 26 dataset. The dataset contains 2000 surveillance video clips collected from YouTube. The training set includes 1600 video clips, and the verification set includes 400 video clips. Each video clip is 5 s and contains 150 frames. It mainly includes violent behaviors such as two persons, multiple persons, and crowds. The scenes are rich and the recognition is difficult, and the video clips are all obtained by security cameras, without multimedia technology modification, which fits the actual scene and has high research value. However, in the course of the experiment, this paper found that the TSM network has a serious over-fitting phenomenon in the RWF-2000 dataset, so this paper expands the dataset on the basis of the predecessors. Based on the open-source violence recognition dataset UCF-Crime, we collect hockey dataset, movies dataset, violent-flow dataset, HMDB51 dataset, and so on as the main scenes of violence in the video, and collect UCF101 and HMDB51 datasets as the main non-violent scenarios in the expanded dataset. The collected video is edited and processed by Adobe Premiere Pro, and the video clips that have nothing to do with behavior recognition are removed, and the data are unified into two kinds of video clips with a length of 1 and 5 s. Finally, this paper constructs a violence recognition dataset containing 5000 video clips, which greatly increases the number of samples, and the scene is richer than RWF-2000, which can solve the problem of over-fitting. Figure 7 shows the basic situation of the dataset. This paper selects 178 video clips from the crowd violence dataset as the training set, and the remaining 98 video clips as the validation set. 200 videos were randomly selected from the hockey dataset as the verification set, and each video was extracted into 41 consecutive images for experiment. Randomly select 400 videos from RWF-2000 as the verification set, and the rest are the training set. Every two frames are intercepted to form a 75-frame continuous image sequence. While reducing the amount of data, try to keep the temporal information in the data complete. For the expanded dataset, the video duration is mainly 1 and 5 s. 1000 video clips from 5000 video clips are randomly selected as verification sets, and all videos are intercepted as image sequences. After all the datasets are processed into continuous image sequences, the total size of the crowd violence dataset is 533 MB, the total size of the hockey dataset is 219 MB, the total size of the RWF-2000 dataset is 10.7 GB, and the size of the expanded dataset is 25.7 GB. Before loading the data into the model, we carry out random data preprocessing, such as clipping, scaling, and rotation, to realize the data transformation.

Parameter Configuration
The deep learning framework used in this paper throughout the training and testing process is Pytorch1.5, the operating system is Ubuntu 16.04, and the CPU is Intel I9-10920X. Use CUDA10.2 to accelerate the GPU and use two NVIDIA RTX2080super GPU with 8 GB of video memory for parallel computing. SGD is used to optimize the algorithm, and the TSM model trained on kinetics is used to reduce the risk of over-fitting and reduce the computational complexity of network training. In the comparative experiment, the experimental environment and dataset are set according to the introduction in this paper, and other basic configurations such as learning rate configuration, algorithm optimization method, and pretraining model are configured according to the instructions of the respective open source projects.
The learning rate adjustment method of the TSM algorithm is 100 epochs of training, the initial learning rate is 0.01, and the learning rate is adjusted to 10% when the training reaches 20 and 40 times. In this paper, when reproducing the original text experiment on the RWF-2000 dataset, it is found that the training loss value of the experiment decreases from the beginning of the training until it is stable, and the verification loss value of the experiment will drop rapidly before the training is started 20 epochs and keep increasing. This indicates that over-fitting occurred during the experiment. In response to the above problems, this paper designs a new learning rate adjustment method. The initial learning rate is 0.01, and the learning rate is adjusted to 90% of the original every two epochs. To a certain extent, this not only accelerates the adjustment speed of learning rate but also accelerates the rate of model learning. As shown in Fig. 8, the adjusted verification loss curve does not show a significant increase after 20 epochs, and the loss value is lower than the traditional method, indicating that the over-fitting problem in the experiment has been alleviated.

Results
After 100 epochs of training and verification of the model, Fig. 9 shows the accuracy curve of the experiment. The blue curve, green curve, and red curve in the figure are the verification accuracy curves of TSM algorithm, two-cascade TSM algorithm, and ECA-two-cascade TSM algorithm in each dataset, respectively.
As can be seen from Fig. 9, the accuracy of the two algorithms proposed in this paper is slightly higher than that of the traditional TSM algorithm. The accuracy curve is stable and the fluctuation is small, which shows that the algorithm can achieve effective feature extraction. Figure 9(a) shows the accuracy curve of various algorithms in the crowd violence dataset. It can be seen that the improved algorithm has a higher accuracy. Figure 9(b) shows the accuracy curve of various algorithms in the hockey dataset. It can be seen that the improved algorithm is obviously more accurate than the traditional algorithm, and the curve is more stable. Figure 9(c) shows the accuracy curve of the RWF-2000 dataset. From the graph, we can see that the accuracy of the improved algorithm is slightly higher than that of the traditional algorithm, but the accuracy decreases obviously after 20 epochs, which indicates that the algorithm has some overfitting.
In order to solve the problem of over-fitting and to further verify the performance of this algorithm, as shown in Fig. 9(d), experiments are carried out in a larger dataset. The experimental results show that the accuracy of the three algorithms in the larger dataset is improved rapidly, and the accuracy curve is stable, which proves that the larger dataset does solve the problem of over-fitting, and further verifies the performance of this algorithm. Table 2 shows the specific situation of violence recognition by different algorithms.
As can be seen from Fig. 10, the algorithm proposed in this paper has a great improvement over the traditional algorithm. In the crowd violence dataset, the two-cascade TSM is 0.989% higher than the TSM, and the ECA-two-cascade TSM is 2.009% higher than the TSM. The twocascade TSM in the hockey dataset is 0.55% higher than the TSM, and the ECA-two-cascade TSM is 1.495% higher than the TSM. In the RWF-2000 dataset, the two-cascade TSM is 0.997% higher than TSM. The ECA-two-cascade TSM is 1.247% higher than TSM. In the expanded dataset, the two-cascade TSM is 0.2% higher than TSM. The ECA-two-cascade TSM is 0.4% higher than TSM.
The above results show that the two-cascade cascade of TSM modules can expand the model's receptive field in the temporal dimension, which also proves that there is still room for improvement in the feature extraction capabilities of the TSM module for long-term information. At the same time, it also suppresses the interference of background information through the ECA module and finally improves the performance of violence recognition.

Discussion
In order to recognize violence behavior in videos, this paper makes improvements on the TSM behavior recognition network. Inspired by LSTM, in order to strengthen the feature extraction ability of TSM module for long-term information, this paper proposes a two-cascaded TSM behavior recognition network, which expands the model's receptive field in the temporal dimension. In order to suppress the interference of background information, an ECA module is inserted at the front end to enhance the sensitivity of the model to spatial information. At the same time, in order to solve the over-fitting problem of some datasets in the experiment, this paper carries on the data expansion and multimedia processing on the basis of the existing datasets. Verification experiments in multiple datasets show that the proposed algorithm can achieve higher accuracy than the traditional algorithms. This means that the algorithm proposed in this paper can improve the ability of the network to understand the characteristics of time and space, solve the problem of over-fitting in the experiment, and realize the effective recognition of violence behavior.