|
1.INTRODUCTIONAttention is a generally accepted indicator of students’ classroom efficiency and learning gains1. To improve the quality of classroom teaching and promote students’ overall development, it is necessary to change the single assessment method in traditional classroom teaching and pay attention to students’ classroom concentration. However, in traditional teaching, there are problems such as the inability to accommodate all students, single evaluation criteria, and lagging feedback. Therefore, how identifying and quantify students’ learning status is an urgent problem to be solved. Montagnini et al.2 used students’ grades as a basis to analyze the efficiency of students’ learning and the effort they put in learning by analyzing their academic performance. Chen et al.3 concluded that the persistence of female students’ classroom attention was higher than that of male students by observation method. However, these methods have disadvantages such as high subjectivity, high labor cost, and inability to be replicated on a large scale. In terms of studying the link between physiological signals and classroom concentration, Zhang et al.4 used a wearable device to read a series of physiological signals including the visual focus of students and analyzed their classroom attention. Other researchers have analyzed students’ classroom attention changes, their patterns, and related influences through eye-tracking5. However, such methods rely on high-cost data collection equipment and are based on a single judgment. In this paper, we combine computer vision technology with the field of education and propose a multimodal approach to determine the concentration level of students’ classroom attention. The method collects real classroom data through a camera, recognizes and analyzes students’ head posture and facial expressions respectively, and then fuses the two behavioral patterns to determine students’ concentration level. We established a database of students’ classroom concentration to quantify, analyse, and visually present classroom concentration. On the one hand, the automatic collection of classroom data reduces human and material resources. On the other hand, we integrated multimodal information, conducted an exploration of the connection between classroom performance and classroom attention, and verified the validity of the findings on the self-built student classroom concentration database. 2.MATERIALS AND METHODS2.1DatabaseWe built a student classroom concentration database by considering the meaning behind students’ behavior from their real learning status. The source data were obtained from a real classroom in a domestic university and an online classroom during the epidemic. For the offline classroom, the recording period was three months, with 48 sessions of 45 minutes each. Two cameras (Sony alpha 7 III) were used for source data collection, one at the instructor’s podium and the other in the left front corner of the classroom. The camera resolution was 1440 × 1080 and the frame rate was 25 fps. For the online classroom, 40 college student volunteers, 16 female, and 24 male were recruited for this study. The volunteers used the webcam on their computers to make video recordings. Each recording should last no less than 10 minutes. After data collection and screening, 11,782 valid images with clear images and little noise were selected from them for manual concentration labeling. According to the Likert scale, this paper divided students’ classroom concentration into five levels, and each level was divided into 5 points. Through multiple labeling, the plural was taken as the concentration level and the mean as the concentration score. A total of thirty people performed the data marking. If the difference in labels was too large, it was handed over to a teacher experienced in teaching to make a decision. 2.2OverviewXu et al.6 concluded that in the case of severe head deflection, the attention of the experimenter was significantly reduced. Meanwhile, a study proved that emotions play an important role in the cognition of individuals7. Therefore, in this paper, head posture and facial expressions are used as evaluation indicators, and two behavioral patterns are combined to implement a multimodal concentration recognition and analysis system. The system is mainly divided into a head posture module, facial expression recognition module, and concentration calculation module. The system first captures classroom video through a camera and crops the target object. The target video is recognized frame by frame for head posture and the concentration of the target student is predicted by regression analysis. If the score is lower than the threshold, it is judged as not focused. If the score is higher than the threshold, it enters the expression module and performs expression recognition and concentration judgment on the target face. Combining the two, the final concentration level of the student is output. The workflow is shown in Figure 1. 2.3Euler angles for head postureIn this study, a model-based approach is used to estimate the user’s head pose based on the geometric relationships or feature points of the object. For head rotation, the Euler angles are used to represent the motion of the object, which are pitch, roll, and yaw angles. The basic idea is to rotate the standard head model to a certain angle until the 3D feature points on the 2D projected image synthesized by the standard model coincide with the feature points on the real image. To accommodate multi-scale faces in the classroom, SSH8 is used as the algorithm for the face detection module in this paper. The detected face feature maps are then passed to the face feature point capture module as parameters. We use the ERT9 algorithm to capture 68 feature points of the face for determining the target motion. The method regresses the face shape from the current shape to the true shape step by step by building GBDT. In this module, there are three coordinate systems. They are the world coordinate system, the camera coordinate system, and the image coordinate system. The face feature points of the real face are located in the image coordinate system, and the feature points of the standard model of the face are located in the world coordinate system. When the rotation translation vector of the object is obtained, the points in the world coordinates can be converted to the points in the camera coordinates. The points in the camera coordinates are then converted to the image coordinate system using DLT by the inherent parameters of the camera, such as focal length and optical center. The basic principle is shown in Figure 2. The resulting example of a student’s head posture is shown in Figure 3. 2.4Facial expression recognitionPsychologists have classified human expressions into seven categories through cross-cultural studies. They are anger, fear, disgust, happiness, sadness, surprise, and neural. The module will first preprocess the detected face images for size, grayscale normalization, and image segmentation. Next, Resnet1810 is used to complete the recognition and classification of the seven expressions. A dropout11 strategy is added before the fully connected layer to increase the model robustness. We remove multiple fully connected layers in traditional Resnet18 and classify directly after one fully connected layer and obtain the confidence level of the seven emotions of the target object. A final example of the emotions obtained in Figure 4. 2.5Attention predictionTo avoid multicollinearity, this study uses Ridge regression12 to model and predict students’ classroom attention with three angles of head pose and seven emotional confidences as input features. In this paper, the fusion is based on the multimodal information score layer. Identify the characteristics of each modal of the student’s classroom and perform regression, and then perform the final fusion of the scores obtained from the characteristics of each modal to obtain the final result. 3.RESULTS AND DISCUSSION3.1Univariate correlation analysisWe define 10 factors as independent variables. They are pitch yaw angle, fear, disgust, happiness, sadness, surprise and neural. We use the following formula to measure the linear correlation r between a single independent variable and attention. where n is the sample size, and and Y are the sample mean values, respectively. SX and SY are the sample standard deviations, respectively. The larger the absolute value of r, the stronger the correlation. After removing the invalid data that cannot be recognized by the face, the correlation coefficient between each independent variable and the degree of attention is shown in Table 1. Table 1.The correlation coefficient.
The data shows that the correlation between head posture and concentration is significantly higher than that of emotion. It can be inferred that the head orientation can better reflect the concentration of students in the classroom. The relationship between emotion and concentration is very weak. The yaw angle has a very weak correlation with concentration, the roll angle has a weak correlation, and the pitch angle has a moderate correlation. The pitch angle is most strongly correlated with concentration. It can be concluded that the situation of students bowing their heads and raising their heads is an important criterion for judging their attention. Among the seven emotions, happy emotions and calm emotions in non-negative emotions were significantly positively correlated with students’ concentration. Anger, depression, fear, sadness, and surprise in negative emotions were significantly negatively correlated with students’ concentration. Therefore, improving students’ emotional experience of learning is an effective way to improve classroom attention. 3.2Model performance comparisonTo verify the accuracy and applicability of the regression equation, this study conducted experiments on the data of several volunteers selected from a self-built database. The results are shown in Table 2. The three model structures are a separate head pose prediction model, a separate facial expression prediction model, and a multimodal prediction model. The root means square error is used as the evaluation index of the model. Table 2.A slightly more complex table with a narrow caption.
As can be seen from the table, the basic performance of HPR is better than that of FER, and the performance of CR is the best. From this, it can also be inferred that head posture accounts for a larger proportion of the factors affecting students’ concentration. Especially for volunteers 12 and 15, the FER performance dropped sharply. Tracing the original data, the confidence levels of the volunteer’s seven emotions were similar, which led to the abnormal situation. The root means square error of different experimenters varies greatly, which proves that students have independent habitual movements and postures, and individual differences are obvious13. In conclusion, the multimodal attention recognition model has better performance and robustness than a single modality. 4.CONCLUSIONThis study constructs a database of students’ classroom concentration, proposes a simple and convenient attention identification method, and explores the relationship between students’ classroom performance and attention. The method obtains features based on the head pose and facial expression and combines the two modalities for ridge regression modeling. The verification and experiments are carried out on the self-built database, and the experimental results are analyzed and demonstrated. It solves the problem that teachers cannot take care of students in traditional classrooms and helps students understand their learning status. It also reduces labor costs for classroom data collection. To sum up, it has theoretical and practical significance for promoting the development of education. ACKNOWLEDGMENTSThis work was supported by the National College Student Innovation and Entrepreneurship Training Program of China (202110384258) and The Social Science Program of Fujian Province (FJ2020B062). REFERENCESLi, X. and Yang, X.,
“Effects of learning styles and interest on concentration and achievement of students in mobile learning,”
Journal of Educational Computing Research, 54
(7), 922
–945
(2016). https://doi.org/10.1177/0735633116639953 Google Scholar
Montagnini, A. and Castet, E.,
“Spatiotemporal dynamics of visual attention during saccade preparation: Independence and coupling between attention and movement planning,”
Journal of Vision, 7
(14), 8
(2007). https://doi.org/10.1167/7.14.8 Google Scholar
Chen, C. M. and Wang, J. Y.,
“Effects of online synchronous instruction with an attention monitoring and alarm mechanism on sustained attention and learning performance,”
Interactive Learning Environments, 26
(4), 427
–443
(2018). https://doi.org/10.1080/10494820.2017.1341938 Google Scholar
Zhang, X., Wu, C. W. and Fournier-Viger, P.,
“Analyzing students’ attention in class using wearable devices,”
IEEE 18th Inter. Symp. on a World of Wireless, Mobile and Multimedia Networks, 1
–9
(2017). Google Scholar
Rosengrant, D., Hearrington, D. and O’Brien, J.,
“Investigating student sustained attention in a guided inquiry lecture course using an eye tracker,”
Educational Psychology Review, 33
(1), 11
–26
(2021). https://doi.org/10.1007/s10648-020-09540-2 Google Scholar
Xu, X. and Teng, X.,
“Classroom attention analysis based on multiple euler angles constraint and head pose estimation,”
in Inter. Conf. on Multimedia Modeling,
329
–340
(2020). Google Scholar
Vanderlind, W. M., Millgram, Y. and Baskin-Sommers, A. R.,
“Understanding positive emotion deficits in depression: From emotion preferences to emotion regulation,”
Clinical Psychology Review, 76 101826
(2020). https://doi.org/10.1016/j.cpr.2020.101826 Google Scholar
Najibi, M., Samangouei, P. and Chellappa, R.,
“SSH: Single stage headless face detector,”
in Proceedings of the IEEE Inter. Conf. on Computer Vision,
4875
–4884
(2017). Google Scholar
Kazemi, V. and Sullivan, J.,
“One millisecond face alignment with an ensemble of regression trees,”
in Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition,
1867
–1874
(2014). Google Scholar
Hao, Z., Ren, F. and Kang, X.,
“Classification of steel strip surface defects based on optimized ResNet18,”
in 2021 IEEE Inter. Conf. on Agents (ICA),
61
–62
(2021). Google Scholar
Poernomo, A. and Kang, D. K.,
“Biased dropout and crossmap dropout: Learning towards effective dropout regularization in convolutional neural network,”
Neural Networks, 104 60
–67
(2018). https://doi.org/10.1016/j.neunet.2018.03.016 Google Scholar
Ndabashinze, B. and Üstündağ Şiray, G.,
“Comparing ordinary ridge and generalized ridge regression results obtained using genetic algorithms for ridge parameter selection,”
Communications in Statistics-Simulation and Computation, 1
–11
(2020). Google Scholar
Raca, M. and Dillenbourg, P.,
“System for assessing classroom attention,”
in Proc. of the Third Inter. Conf. on Learning Analytics and Knowledge,
265
–269
(2013). Google Scholar
|