In human activity classification, detecting speaking activity can be applied further in behavior analysis such as student learning behavior in an active learning environment. This paper presents a method for classifying whether or not a person is speaking based on lib movement in a video sequence. Assuming that a person of interest is tracked within a room using multiple cameras, at least one camera can capture the face of a target person at every instant of time. Using this sequence of frames of a target person, this paper proposes a method for continuously deciding whether the person is speaking. Firstly, head part is segmented based on (1) the head's top position, (2) head's width and golden ratio of head's height and width. Secondly, the face area is extracted using a skin detection technique. Thirdly, the mouth area in each frame is segmented based on its geometry on a face and a mouth has different color from face skin. Next, mouth opening is roughly detected based on the fact that the opening area has a darker gray level than its average. Finally, only frequency components between 1 Hz to 10 Hz of the detected feature signal is extracted and used to classify the speaking activity by comparing with a threshold. The proposed method is tested with 3 sets of videos. The results showed that the speaking classification and mouth detection achieved 93 % and 94 % accuracy, respectively.
A computer vision computation requires high number of multiplications causing a bottleneck. Based on the work of Zhenhong Liu, the multiplications in these algorithms do not always require high precision provided by the processors. As a result, we can reduce computation redundancy by means of multiplication approximation. Following this approach, in this paper, we investigate two major algorithms namely convolutional neural network (CNN) and scale-invariant features transform (SIFT) to find their error tolerances due to multiplication approximation. A multiplication approximation is done by injecting a random value to each of precise multiplication value. The INRIA and OXFORD datasets were used in the SIFT algorithm analysis while the CIFAR-10 and MNIST datasets were applied for the CNN experiments. The results showed that SIFT can withstand only small percents of multiplication approximation while CNN can tolerate over 30% of multiplication approximation.
Object tracking based on image processing algorithm is used in various applications. In many object tracking methods, feature extraction is the key processing for object identification. Image pre-processing which transforms an RGB image to a binary image plays an important role. The conventional pre-processing technique applies on a whole image frame. In this paper, we propose to crop the regions of the tracked objects using their current tracking positions. Then, each cropped region is fed to the pre-processing process. Based on this approach, the interference of uninterested regions is eliminated resulting in improved pre-processed image. However, we need to perform N pre-processing processes, where N is the number of tracked object in the frame. This problem is alleviated in FPGA implementation, which is our target platform. The proposed approach is evaluated by comparing the results with conventional pre-processing method using the same tracking system.
In an active learning environment, a student activities is crucial to his/her learning achievment. However, keeping track of the student activities by teaching staffs is almost impossible. Hence, using technology for such tedious but important job has become attractive or even necessary. Focusing on such environment, this paper proposes a method of classifying whether a student is writing or reading or working on other things such as doing experiments based on sequential image frames from a single camera. For each frame, an area including the student is cropped out using a background subtraction and thresholding. Then, using the skin detection technique, face and hands of the target students are detected. Such face and hand areas of n sequence of frames are combined as a Gait Energy Image (GEI), which is being used as feature images for the classification in which the Principal Component Analysis is applied. A sum score of the PCA in which each row as an observed sample is taken as a feature while another score of PCA in which a column is considered to be an observed sample is taken as another feature. Using the support vector machine, the two features are used to classify whether a student is “reading” or “not reading” first. Then, for a “not reading” sample, it is classified whether it is “writing” or “doing experiment”. Based-on a sequence of simulated activities, the proposed method can classfy between “reading” and “not reading” with 93% accuracy while the classifying between “writing” and “doing experiment” class achieved 90% accuracy.