Object recognition has wide applications in the area of human-machine interaction and multimedia retrieval.
However, due to the problem of visual polysemous and concept polymorphism, it is still a great challenge to
obtain reliable recognition result for the 2D images. Recently, with the emergence and easy availability of
RGB-D equipment such as Kinect, this challenge could be relieved because the depth channel could bring more
information. A very special and important case of object recognition is hand-held object recognition, as hand is
a straight and natural way for both human-human interaction and human-machine interaction. In this paper,
we study the problem of 3D object recognition by combining heterogenous features with different modalities
and extraction techniques. For hand-craft feature, although it reserves the low-level information such as shape
and color, it has shown weakness in representing hiconvolutionalgh-level semantic information compared with
the automatic learned feature, especially deep feature. Deep feature has shown its great advantages in large
scale dataset recognition but is not always robust to rotation or scale variance compared with hand-craft feature.
In this paper, we propose a method to combine hand-craft point cloud features and deep learned features in
RGB and depth channle. First, hand-held object segmentation is implemented by using depth cues and human
skeleton information. Second, we combine the extracted hetegerogenous 3D features in different stages using
linear concatenation and multiple kernel learning (MKL). Then a training model is used to recognize 3D handheld
objects. Experimental results validate the effectiveness and gerneralization ability of the proposed method.
Athlete identification is important for sport video content analysis since users often care about the video clips with their
preferred athletes. In this paper, we propose a method for athlete identification by combing the segmentation, tracking
and recognition procedures into a coarse-to-fine scheme for jersey number (digital characters on sport shirt) detection.
Firstly, image segmentation is employed to separate the jersey number regions with its background. And size/pipe-like
attributes of digital characters are used to filter out candidates. Then, a K-NN (K nearest neighbor) classifier is employed
to classify a candidate into a digit in "0-9" or negative. In the recognition procedure, we use the Zernike moment features,
which are invariant to rotation and scale for digital shape recognition. Synthetic training samples with different fonts are
used to represent the pattern of digital characters with non-rigid deformation. Once a character candidate is detected, a
SSD (smallest square distance)-based tracking procedure is started. The recognition procedure is performed every several
frames in the tracking process. After tracking tens of frames, the overall recognition results are combined to determine if
a candidate is a true jersey number or not by a voting procedure. Experiments on several types of sports video shows