Face recognition has been widely studied recently while video-based face recognition still remains a challenging task because of the low quality and large intra-class variation of video captured face images. In this paper, we focus on two scenarios of video-based face recognition: 1)Still-to-Video(S2V) face recognition, i.e., querying a still face image against a gallery of video sequences; 2)Video-to-Still(V2S) face recognition, in contrast to S2V scenario. A novel method was proposed in this paper to transfer still and video face images to an Euclidean space by a carefully designed convolutional neural network, then Euclidean metrics are used to measure the distance between still and video images. Identities of still and video images that group as pairs are used as supervision. In the training stage, a joint loss function that measures the Euclidean distance between the predicted features of training pairs and expanding vectors of still images is optimized to minimize the intra-class variation while the inter-class variation is guaranteed due to the large margin of still images. Transferred features are finally learned via the designed convolutional neural network. Experiments are performed on COX face dataset. Experimental results show that our method achieves reliable performance compared with other state-of-the-art methods.