Human embryonic stem cell classification: random network with autoencoded feature extractor

Abstract. Significance: Automated understanding of human embryonic stem cell (hESC) videos is essential for the quantified analysis and classification of various states of hESCs and their health for diverse applications in regenerative medicine. Aim: This paper aims to develop an ensemble method and bagging of deep learning classifiers as a model for hESC classification on a video dataset collected using a phase contrast microscope. Approach: The paper describes a deep learning-based random network (RandNet) with an autoencoded feature extractor for the classification of hESCs into six different classes, namely, (1) cell clusters, (2) debris, (3) unattached cells, (4) attached cells, (5) dynamically blebbing cells, and (6) apoptotically blebbing cells. The approach uses unlabeled data to pre-train the autoencoder network and fine-tunes it using the available annotated data. Results: The proposed approach achieves a classification accuracy of 97.23±0.94% and outperforms the state-of-the-art methods. Additionally, the approach has a very low training cost compared with the other deep-learning-based approaches, and it can be used as a tool for annotating new videos, saving enormous hours of manual labor. Conclusions: RandNet is an efficient and effective method that uses a combination of subnetworks trained using both labeled and unlabeled data to classify hESC images.


Introduction
The classification of hESCs in video is essential for quantifiable analysis of hESC processes and behavior. 9 However, manual analysis of stem cells is laborious, tedious, and often inaccurate due to three main human limitations. First, the accuracy of a human performing classification is inversely proportional to long working hours. Second, uncertainty in classification occurs due to a wide variety of objects that appear in a class. Third, the amount of time put into working on datasets can lead to confusion in classifying hESCs into the right classes. Figure 1 shows a modularized system overview for an automated segmentation and classification process. In this paper, we focus essentially on the classification of the detected components from hESC videos; the detected components are the six general classes shown in Fig. 1. Guan et al. 3 provide details of a method for the fast detection and segmentation of individual video components.
Because phase contrast imaging is a non-invasive microscopy technique, it is widely used to study the behavior of live hESCs in video. 10 In this study, the hESC videos were taken with a BioStation IM. 11 The Biostation has an incubator with time-lapsed video capability. Each video captures an assay. The BioStation IM enables video capture of living cells under a stable and optimal environment. More details about BioStation IM and the images can be found in Talbot et al. 7 The hESC videos consist of frames of phase contrast images. Each frame can contain any of the following six general components: (1) cell clusters, (2) debris, (3) unattached cells, (4) attached cells, (5) dynamically blebbing cells, and (6) apoptotically blebbing cells. Among these unattached, attached, dynamically blebbing, and apoptotically blebbing cells are the four classes that are of significant interest in experimental work. These four classes are regarded as the four intrinsic cell types in a video. Figure 2 shows examples of the six classes. Conceptually, the six classes of hESCs can be distinguished with three fundamental human perceptual capabilities for identification and classification of objects: (1) shape, (2) intensity, and (3) texture. Each class can be uniquely identified by one or a combination of the aforementioned human perceptions. For instance, the apoptotically blebbing cells in Fig. 2(f) are similar in intensity, shape, and texture among themselves. hESCs in Figs. 2(e) and 2(f) are dissimilar in intensity, but they are similar in shape and texture. The debris in Fig. 2(b) has similar intensity values as various classes shown in Fig. 2. Traditionally, a feature vector can be derived with the aforementioned human perceptions. However, with the advent of deep learning techniques, we can develop classification models with the given abundance of labeled data. Therefore, the need to generate a feature vector manually for a classification system is only suitable when data are quite limited.
With the consideration that we often see an abundance of unlabeled data rather than labeled data, we propose a random network (RandNet) with an autoencoded feature extractor. The proposed method focuses on building random subnetworks with the feature extractor derived from unlabeled data. Moreover, the proposed method incorporated ensemble methodology in the network to reduce overfitting.

Related Work
To develop a practical system with high classification accuracy, modularization structure is often preferred over a deep learning approach that simultaneously performs detection and segmentation because modularized components allow for flexibility and adaptability as shown in Fig. 3 and Refs. 12-14. We consider segmentation and classification to be two separate modularized components or subsystems. Additionally, direct classification from the input videos is extremely challenging because these are dynamic images evolving over time.  In this paper, we focus on the classification component. There has been very limited work on building an automated classification system for stem cells in video with both labeled and unlabeled datasets. 8 Niioka et al. 15 used convolutional neural network (CNN) to study cellular differentiation from myoblasts to myotubes. Their classification model was built upon the concept that cellular morphology changes during differentiation, and this feature was easily captured in stained fluorescent images. In addition, Xie et al. 16 worked on fluorescent images with CNN for cell counting. Although they have a successful experiment, their classification problem was simple since their images contained only circular dots. Chang et al. 17 also used CNN for human induced pluripotent stem cell regions classification. 17 Their study focused on classifying cell cluster patterns. The dataset used in the works by Niioka et al., 15 Xie et al., 16 and Chang et al. 17 came from experiments that use staining techniques; staining is a very intrusive technique to be used on cells for contrast enhancement. However, our hESC experiments were done without staining.
Similar work on stem cell classification with phase contrast images was proposed by Theagarajan et al. 18,19 They suggested using a generative method to train the network and classify real data. However, they did not consider realistic unlabeled data, which can be efficiently generated for training; typical generative methods have huge computational cost for synthetic dataset generation as well as training with a large set of synthetic data. Therefore, this paper proposes using the unlabeled data (without the use of generative methods) for model training and fine-tuning the model with labeled data.

Contributions of this Paper
In this paper, we focus on the classification component. From Fig. 2, we can infer that there are four major challenges in hESC classification. First, when attached cells spread thin in the substrate, the cells are fused with the background. Second, dynamically blebbing cells and apoptotically blebbing cells are similar in intensity. Third, when a large attached cell goes through the apoptotic process, it appears as a cell cluster of apoptotically blebbing cells. Fourth, image data are obtained under both 10× and 20× objectives, which adds challenges in discerning individual blebbing cells from cell clusters. In light of the state of the art, the contributions of this paper are as follows.
• We introduce the concept of creating a modularized system to automatically segment and classify hESCs in video. This reduces the complexity of the problem since it is extremely challenging to classify hESCs directly from the video in a single step.
• We introduce the concept of building feature extractor with unlabeled data and unsupervised learning. Hence, we do not require huge amounts of labeled data as is required in deep learning based approaches.
• We incorporate ensemble methodology into the proposed RandNet to handle the diversity of data generated during the experiments that last at least 48 to 100 h. We are not aware of any such work in biological image analysis.
• We provide experimental results and comprehensive comparison with state-of-the-art techniques.
Section 2 presents the materials and methods in detail. Section 3 provides experimental results, and Sec. 4 provides a discussion on the proposed and compared methods. Finally, Sec. 5 presents the conclusions of the paper.

Materials
All time lapse videos were obtained with the phase contrast microscope in BioStation IM. 7,11 The videos were acquired using either a 10× or 20× objective with 600 × 800 pixel resolution. A total of 27,603 unlabeled gray scale images and 3559 labeled gray scale images were obtained from six 10× videos and eight 20× videos. Both unlabeled and labeled images were obtained automatically by the method described in Guan et al. 3,20,21 The labeled dataset had the following number of gray scale images for each class: (1) 636 cell cluster images, (2) 773 debris images, (3) 519 unattached cell images, (4) 704 attached cell images, (5) 413 dynamically blebbing cell images, and (6) 514 apoptotically blebbing cell images. The ground-truth for the datasets were generated manually by stem cell experts. We used 75% of the dataset for training and the remaining 25% of the dataset for out-of-sample testing for each class. To generalize the classifier, fivefold cross validation was done during model learning. Model learning is performed with training data only.

Methods
In this section, we first present the motivation for our proposed approach. This is followed by a method for automated cell region detection, which is the segmentation component. We then describe RandNe and elaborate on the autoencoded feature extractor as well as the pre-trained subnetworks for the classification component. The classification component is part of the modularized system as shown in Fig. 3. A pseudocode for building the RandNet model is also provided.

Motivation of the approach
Domain knowledge often comes from human perception, which is the most complex yet efficient cognitive system. Through hypothetical assumeption and visual inspection, we can sometimes identify useful features of hESCs for classification. However, domain knowledge is limited by the amount of information the brain can absorb. With tens of thousands of unlabeled and labeled data, experts can have hard times in either conceptualizing or generalizing the hidden information contained in the data. Deep learning techniques can help to understand the vast amount of data and solve the difficulty in creating automated algorithms for repetitious tasks performed by humans. Consider the task of studying apoptotic processes of cells with test chemicals in a toxicity experiment. Observing the dynamic changes in the texture and shape of apoptotic processes of a cell requires a significant amount of manual labor for annotating individual video frames. Currently, biologists spend hours of manual labor in annotating these images, which is a very tedious and menial task. Our deep learning based approach can learn to automatically segment these frames from the vast amount of data available in an unsupervised manner, thus significantly reducing the amount of time biologists spend annotating images, which improves their efficiency. The proposed approach uses an unsupervised technique to build the foundation of the encoder network. The proposed method also uses of both the unlabeled and labeled data to build a reliable classification system.

Segmentation component
Guan et al. 3 proposed a model based method for automatically segmenting hESCs. This automated cell region detection is an essential algorithm in developing automated frame component decomposition in hESC phase contrast videos. They considered the foreground and background intensity distribution to be a mixture of two Gaussians. The objective of their algorithm is to find an optimal threshold that optimizes a criterion derived from the intensity distribution of foreground and background. The optimal segmentation is achieved at the highest criterion value. Since the segmentation method yields a binary image for each frame, we were able to extract a pool of individual components from each frame. Figure 4 shows the detected components of frames under 10× and 20× objectives. These detected components are then ready to be classified into one of the six aforementioned classes.

Classification system overview
The proposed classification system is built with both labeled and unlabeled data, and it consists of many random pre-trained subnetworks. The proposed method utilizes unlabeled data to build the encoder component in the pre-trained subnetworks and labeled data to fine-tune the RandNet. The RandNet structure also incorporates ensemble methodology to constrain overfitting. Figure 5 shows a graphical depiction of how RandNet is built with pre-trained subnetworks and the ensemble concept.

Random network
RandNet utilizes the concept of bagging in deep learning by creating subnetworks. Bagging or bootstrap aggregation is a machine learning concept used to reduce variance and avoid overfitting. [22][23][24][25] RandNet, developed in this paper, is a method that contains many subnetworks that have a common pre-trained model and are fine-tuned with random samples. RandNet uses all of the results from each subnetwork and passes it to a stacking network in which the final decision is made. The detail of the stacking network is shown in Fig. 6. The stacking network is designed to be simple and has only two main dense layers.

Autoencoded feature extractor
The autoencoder network is an efficient unsupervised learning method that learns the representation of a set of data. The autoencoder network contains two major components: encoder and decoder. [26][27][28] In this paper, we used a structure similar to AlexNet as the basis of an encoder, and then we designed a decoder network from it. Although the VGG architecture 29 slightly outperforms AlexNet 30 as shown in Sec. 3.3, this difference is not significant, and since the AlexNet architecture requires reduced computational resources, we chose it for its simple implementation. As shown in Fig. 5(a), the encoder generates a set of latent representations for the unlabeled data. The details of both encoder and decoder structures are shown in Fig. 7. The autoencoder network used the Adadelta optimizer 31 and the pixel-wise binary cross-entropy loss function. Since the final layer in the autoencoder network was chosen to be a sigmoid activation layer, pixel-wise binary cross entropy is an applicable loss measure. The loss function equation is given as follows: where Loss AE is the total pixel-wise loss in the autoencoder network, N S is the total number of sample images in a batch, and N R and N C are the total number of rows and columns, respectively. I ðiÞ ðr; cÞ and K ðiÞ ðr; cÞ are the ground-truth and predicted label values, respectively, in the r'th row and c'th column for the i'th sample. Both I ðiÞ ðr; cÞ and K ðiÞ ðr; cÞ ∈ ½0;1.

Pre-trained subnetwork
The subnetwork used the encoder structure derived from the autoencoder network [in Step 2, Fig. 5(b)] as the basis for building a subclassifier. Each pre-trained subnetwork is fine-tuned with random samples and has a topper structure. The layers of the topper structure are shown in Fig. 8.
Since the encoder structure was unfrozen in each subnetwork, the fine-tuning with random samples affects the weights in the encoder structure. Therefore, we were able to emulate bagging for the proposed method. For this subnetwork, we use categorical cross entropy as our loss function, which is given as  Table 1 Pseudocode for building the classifier model. (2) where Loss CCE is the total categorical cross entropy in the pre-trained subnetwork. N S and M are the total number of samples images and classes in a batch, respectively. y ði;jÞ and p ði;jÞ are the ground-truth and predicted values, respectively, for i'th sample and j'th class, where, y ði;jÞ and p ði;jÞ ∈ f0; 1g. Table 1 shows the pseudocode for building the classifier model.

Parameters and Optimization
In our approach, all cropped images after the detection module were resized to 224 × 224 with bicubic interpolation, and the image intensities were normalized by dividing them by 255. No additional data augmentation was performed. For the autoencoder network, each subnetwork was trained independently, and the latent representation of the subnetwork was used to train the topper network. There are two fixed parameters for each subnetwork: epochs and batch size, which are set to be 10 and 128, respectively. The default Adadelta optimizer is used for the autoencoder network. 31 For RandNet, there are five parameters: epochs, batch size, number of subnetworks, learning rate, and decay rate. We used 25 epochs with early stopping, a batch size of 50, and a total of 33 subnetworks. We also used a default Adam optimizer 34 with the learning rate of 0.001. All parameters are fixed except the number of subnetworks, which has a search range from 1 to 37 with a step size of 2. Figure 9 shows that, when the number of subnetworks equals 33, it has the highest average validation accuracy as well as the lowest average validation loss. It should also be noted that the processing speed for our approach using all 33 subnetworks during inference is 6.25 frames per second (FPS) compared with 4.16 FPS using the approach proposed by Theagarajan et al. 19 Using an ensemble of classifiers is similar to using dropout during training, but they are not the same. 35 Ensemble training focuses on training each network with a different subset of data while dropout reduces feature spaces randomly. Although both ensemble method and dropout can generalize the network, the former influences the model with data and the latter manipulates the extracted features. The proposed method uses a simple subnetwork, and each subnetwork was trained independently; therefore, dropout was not considered in each subnetwork. Most importantly, data-driven model preserves all essential features for reconstructing the input image in a simple autoencoder network. Figure 10 shows the comparison of the reconstructed images with and without dropout. It can be seen that when we use dropout the reconstructed images are blurrier due to missing feature information.

Performance Measures
For performance analysis and comparison, we used the confusion matrix for evaluation. 36 The following equations show the calculations for the overall and individual classification accuracy from the confusion matrix. The average classification rate and individual true positive rate (TPR) are given by the following equations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 2 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 2 6 0 It is worth noting that CM ii is an ii'th element in the confusion matrix CM. CM is an element of R N class ×N class where N class is the total number of classes. N is the total number of evaluated observations. TPR j is the true positive rate/recall for the j'th class. N j is the total number of samples in the j'th class. CM ij is the element of CM in the i'th row and j'th column. There are three different categories of accuracies in evaluating the performance of a model: (1) training accuracies, (2) validation accuracies, and (3) out-of-sample testing accuracy. Training and validation accuracies refer to cross validation accuracies for training and validating sets, respectively. The out-of-sample testing accuracy is slightly different than the validation scheme. Once the best model parameters are learned from the model selection process, the final model is obtained with the entire training dataset and the best parameters. This final model is then used to evaluate the performances of the testing dataset, and it produces the out-ofsample accuracy. Typically, training and validation accuracies show us the estimated bias and variance in the final model while out-of-sample testing accuracy shows the true variance in the final model.

Experimental Results
The proposed RandNet is compared with the state-of-the-art methods as reported in Table 2. The top two performers are the proposed RandNet and the fused CNN triplet. 19 The proposed RandNet has 97.23% mean accuracy in a five-fold cross validation and a seemingly low standard deviation in its validation results. The reason that both RandNet and fused CNN triplet outperformed other methods is that additional data are being used. Both aforementioned methods were trained with data other than the given labeled data. The RandNet used unlabeled data to pre-train its subnetworks and then fine-tuned it with the labeled data. On the other hand, fused CNN triplet 19 used both synthetic data and real labeled data in training. ResNets, 37 VGGs, 29 and AlexNet 30 were trained with only labeled data. Consequently, they seem to have higher variance in their performances. They also perform worst in out-of-sample testing, as shown in Table 3.

Discussions
When comparing with ResNets, VGGs and AlexNet, the proposed method outperformed these methods by at least 6% as shown in Table 3. The performance of these other methods was close within their individual standard deviations. The proposed method has a significantly lower standard deviation than ResNets, VGGs and AlexNet. Therefore, the proposed method still performed better in out-of-sample testing. Since the proposed method incorporated the concept of bagging and used 33 random subnetworks, the proposed method has a low standard deviation.  When comparing with fused CNN triplet, 19 RandNet outperformed fused CNN triplet in both five-fold cross validation and out-of-sample testing. As shown in Table 2, RandNet was about 2% better than fused CNN triplet in validation results. In terms of out-of-sample testing, the proposed method had a slight 0.45% lead on fused CNN triplet as shown in Table 3. The confusion matrix of the proposed method on the testing dataset is shown in Table 4. The proposed method also outperformed fused CNN triplet of Ref. 19 in terms of training cost. RandNet's computational cost in training is significantly lower than that of fused CNN triplet. According to Theagarajan et al., 18

Misclassification Samples
The proposed method had at least 93% TPR/recall for each individual class, as shown in Table 5. It performed better in identifying attached cells, with a total of 98.30% recall. However, it performed worst for unattached cells. Unattached cells are generally easy to identify as shown in Fig. 2(c).
From the typical misclassified images in out-of-sample testing as shown in Fig. 11, we conclude that the blurring effects in the autoencoder network might be the cause for misclassifications. As shown in Figs. 11(b) and 11(c), two unattached cells were blurred out after passing through the autoencoder network. Therefore, these cells looked similar to the attached cells visually. Moreover, this blurring effect might be more significant on the hidden representation generated by the encoder that was used to build the subnetworks.   38 and passed the segmented images as input to our classification component. The classification results and recall for each cell types are shown in Tables 6 and 7, respectively.
As shown in Table 7, the recall for each cell type was above 89%, and the proposed classification component had an accuracy of 93.79% on the Mask RCNN segmented images. Since the proposed classification component was not trained with samples from Mask RCNN, a small accuracy degradation was expected. The proposed classification component still showed good performance reliability on data samples that were not generated by the proposed segmentation method.  Table 6 Confusion matrix for RandNet using Mask RCNN as the segmentation component.

Conclusions
Automated classification of hESCs in phase contrast videos is essential for a fast quantifiable analysis of hESC behaviors. The proposed RandNet utilized unlabeled data for pre-training, and it incorporated both transfer and ensemble learning concepts. RandNet not only has lower training cost with pre-trained models, but it also can improve performance through fine-tuning with labeled data. It had low performance variance in the cross validation results. This paper has demonstrated that RandNet is an efficient and effective method. In term of efficiency, it uses the combination of subsampling and pre-trained models to generate subnetworks. In term of effectiveness, it is a robust method that provides a generalized solution for hESC classification. Our objective in this paper has been to show that we can use both labeled and unlabeled datasets. This software enables quantitative analysis of changes in and behavior of hESCs in video. In the future, we will explore additional deep networks for building subnetworks. Since the blurring effects of the current simple network affected classification performance, we will explore deeper networks to learn a finer hidden representation for hESC classification.

Disclosures
The authors have no potential conflicts of interest to disclose.