Semantic segmentation is a high-level task in computer vision that associates each pixel of an image with a semantic(class) label. Fine-semantic segmentation is a pixel-level task that provides detailed information necessary to easily identify the region of the object of interest. Hands are one of the main channels for communication, enhancing human-object and human-environment interaction, and in egocentric videos, they appear to be ubiquitous and at the center of vision and activities, hence our interest in hand segmentation. Fine-semantic segmentation of hands locates, identifies, and groups together pixels associated with the hands, with a hand semantic label. We performed fine semantic segmentation of hands, by improving the architecture of the state-of-the-art deep convolutional neural network (RefineNet). We achieve a finer and more accurate result by amending the process of obtaining and combining high and low-level features, and the pixel grouping for pixel-level classification. We performed this task on a public egocentric video dataset (EgoHands). We evaluate our model (RefineNet-Pix) performance by adopting the existing pixel-level metric, mean precision (mPrecision). Comparing our result with the baseline reported in Urooj’s work, we obtain accuracy higher than 87.9% of the benchmark. Our finer and more accurate semantic segmentation result guarantees good performance under various lighting conditions and complex backgrounds, making it suitable for use in both indoor and outdoor environments. Fine-hand semantic segmentation can be applied in image analysis, medical systems (with a focus on understanding hand motion for prediction, diagnosis, and monitoring), hand gesture recognition (human-computer interaction and understanding action), and robotics(grasp and manipulation of objects).
|