Global chlorophyll-a concentration estimation from moderate resolution imaging spectroradiometer using convolutional neural networks

Abstract. The accurate estimation of global chlorophyll-a (Chla) concentration from the large remote sensing data in a timely manner is crucial for supporting various applications. Moderate resolution imaging spectroradiometer (MODIS) is one of the most widely used earth observation data sources, which has the characteristics of global coverage, high spectral resolution, and short revisit period. So the estimation of global Chla concentration from MODIS imagery in a fast and accurate manner is significant. Nevertheless, the estimation of Chla concentration from MODIS using traditional machine learning approaches is challenging due to their limited modeling capability to capture the complex relationship between MODIS spatial–spectral observations and the Chla concentration, and also their low computational efficiency to address large MODIS data in a timely manner. We, therefore, explore the potential of deep convolutional neural networks (CNNs) for Chla concentration estimation from MODIS imagery. The Ocean Color Climate Change Initiative (OC-CCI) Chla concentration image is used as ground truth because it is a well-recognized Chla concentration product that is produced by assimilating different satellite data through a complex data processing steps. A total of 12 monthly OC-CCI global Chla concentration maps and the associated MODIS images are used to investigate the CNN approach using a cross-validation approach. The classical machine learning approach, i.e., the supported vector regression (SVR), is used to compare with the proposed CNN approach. Comparing with the SVR, the CNN performs better with the mean log root-mean-square error and R2 of being 0.129 and 0.901, respectively, indicating that using the MODIS images alone, the CNN approach can achieve results that is close to the OC-CCI Chla concentration images. These results demonstrate that CNNs may provide Chla concentration images that are reliable, stable and timely, and as such CNN constitutes a useful technique for operational Chla concentration estimation from large MODIS data.

Space Agency (ESA) have global ocean monitoring capability (Table 1). Accurate and fast estimation of global Chla concentration from these satellites' images is crucial for various purposes, such as the net primary production derivation, 1,2 harmful algal blooms detection, [3][4][5][6] physical and biological interactions, 7,8 and ecosystem models validation. 9,10 The twin-moderate resolution imaging spectroradiometer (MODIS), terra-MODIS and aqua-MODIS, have high temporal resolution to view the entire Earth's surface every 1 to 2 days which have provided big remote sensing data. The big data characteristics of satellite remote sensing calls for an efficient Chla concentration estimation algorithm that can handle the big remote sensing data in an efficient and effective manner. Nevertheless, the relationship between the Chla concentration and the remote sensing observations is a very complex nonlinear function, as a consequence, design of an end-to-end machine learning approach with strong learning capability that can capture the complex nonlinear relationship between the Chla concentration and the remote sensing observations is an important research issue. Support vector machine, a classical traditional machine learning approaches, has limitations due to their relative weak modeling capability compared to neural networks to handle the complexity and uncertainty in the inverse problem and also the low computational efficiency. 11 Deep learning, especially convolutional neural networks (CNNs), is well known for its using both spatial and spectral information, strong modeling capability and leveraging GPU for high computational efficiency, by greatly outperforming the other machine learning techniques in a lot of computer vision tasks. [12][13][14][15][16] Remote sensing scientists have seen the value of CNNs when they face the tasks of classification, segmentation, object detection, change detection, and super resolution. [17][18][19] However, very limited studies focus on using CNNs to solve regression problems. Although CNNs have huge potential for addressing the problems in Chla estimation, to the best of our knowledge, there is no study to explore the capability of CNNs for the estimation of global Chla concentration. Therefore, this paper explores deep CNN for the estimation of global Chla concentration. Nevertheless, deep CNN requires a lot of training samples, but the ground-truth field data are very limited, and as such we adopt the Ocean Color Climate Change Initiative (OC-CCI) images as ground truth.
The OC-CCI project is one of the 13 projects in the ESA CCI program that aim at addressing a particular essential climate variable from satellite observations. The objective of this project is to produce a long-term, multisensor time-series of satellite ocean color data, e.g., Chla concentration, with specific information on errors and uncertainties. 20 The OC-CCI Chla concentration dataset is produced by combining the ocean chlorophyll 3 (OC3) algorithm, the ocean chlorophyll 5 (OC5) algorithm, and the OCI algorithm [ocean chlorophyll 4 (OC4) algorithm + color index (CI) algorithm] 21-23 on the merged data of SeaWiFS, MERIS, MODIS, and VIIRS. Implemented by a sophisticated processing chain, OC-CCI intends to provide a global Chla concentration product of the highest quality, particularly on Chla retrievals from case 2 waters in global scale, although these may not be the latest. 24 Although inevitable biases exist in the OC-CCI products when being matched against the in situ observations, they are the most accessible and reliable Chla concentration product. This paper explores the use of the CNN regression method for the end-to-end estimation of the global Chla concentration from the imagery recorded by the MODIS sensor on board

CNN
A CNN usually takes an image as input and outputs its corresponding class label. CNN architectures comprise typically several layers: convolutional layers, pooling layers, and fully connected layers. A slightly different version from the classical LeNet 12 architecture is shown in Fig. 1 to explain CNN. The convolution layer 1 takes an RGB image X and kernels W ð1Þ which consist of weights as inputs and outputs feature maps a ð1Þ . The size of X is channel × height × width ¼ 3 × 28 × 28 and the size of W ð1Þ is number × channel × height × width ¼ 20 × 3 × 5 × 5 which means that there are 20 three-dimensional (3-D) kernels. Kernels should have the same channels as the corresponding input image. In general, there will be multiple filters and multiple feature maps are generated and stacked to form a multilayer feature map. 27,28 Each feature map is obtained by dot product between receptive field and kernel in specific rules of scanning the input image. The weights in the kernel will be learned by training. The process of convolutional layer 1 can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 3 8 9 a ð1Þ ¼ f½W ð1Þ Ã X þ B ð1Þ ; (1) where the operation of convolution is denoted by * and fð·Þ means the activation function. B ð1Þ is the bias of first convolution layer with the same size as a ð1Þ . The height and width of feature maps a ð1Þ are decided by the height and width of kernels W ð1Þ and the step size of the kernels named stride. They will be calculated as width ¼ The stride of W ð1Þ is 1, so the height and width of feature maps a ð1Þ is 28 − 5 þ 1 ¼ 24. The size of a ð1Þ is channel × height × width ¼ 20 × 24 × 24, and the channel is same as the number in Fig. 1 Structure of a CNN which is slightly different from the classical LeNet. The order of 3-D X and a are channel × height × width. Kernels W in each convolution layer are four-dimensional and the number before at stands for number of the 3-D kernels, so the order of the four-dimensional kernels W is number × channel × height × width. kernels W ð1Þ . It can be explained that the 20 feature maps a ð1Þ could be treated as a 3-D images as X. The activation function employs nonlinear transformation [e.g., sigmoids, tanh, rectified linear unit (ReLU), etc.] to each element in the outputs of convolutional process. Conventionally, each convolutional layer is followed by a pooling layer in order to reduce the number of parameters to learn and reduce the variance of features, [29][30][31] which can be accomplished by operations like maximum, average, etc. No weights need to be learned in this layer. The process of pooling layer 1 can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 6 5 0 where down½· represents down sampling. In general, the output of a pooling layer will be reduced by half of input, so the size of a ð2Þ is channel × height × width ¼ 20 × 12 × 12. The process of convolutional layer 2 and pooling layer 2 are same as convolutional layer 1 and pooling layer 1 which can be expressed as A fully connected layer takes all neurons in the input layer, no matter a convolutional layer or fully connected layer, and connects them to each neuron in its layer. It is regularly used as the last few layers in a CNN architecture and yields as many outputs as labels to a classifier layer. The process of two fully connected layers can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 4 6 4 a ð5Þ ¼ f½W ð5Þ · a ð4Þ ð∶Þ þ B ð5Þ ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 4 2 0 In the fully connected layer 1, the 3-D a ð4Þ will be flatted to a vector with the size changing from channel × height × width ¼ 50 × 4 × 4 to ð50 × 4 × 4Þ ¼ 800. An inner product performs between W ð5Þ and a ð4Þ , and the size of b ð5Þ is same as a ð5Þ . The fully connected layer 2 is similar but without an activation function, and the output is set as 10 neurons according to the number of class, which is the 10 numbers of handwritten and machine-printed character recognition in LeNet. In the last layer, a softmax function σð·Þ will be used as classifier to generate the label Y. Based on the equation above, the output of the CNN is formulated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 3 2 0Ŷ ¼ σfW ð6Þ · f½W ð5Þ · downff½W ð3Þ Ã downff½W ð1Þ Ã X þ B ð1Þ g þ B ð3Þ gð∶Þ þ B ð5Þ þ B ð6Þ g:

Methodology
In this paper, a patch-based CNN regression method is used to estimate global Chla concentration from the MODIS imagery. The patch-based regression CNN model scans through images and crops a patch from the images at each location of the scanning. For each patch, it uses the regression CNN to predict the Chla concentration value of the pixel at the patch center. Then the model assembles the whole predicted values to form the global Chla concentration map. The complete procedure is shown in Fig. 2. It consists of three parts: the preprocessing of the input data, the training of the CNN model, and the data prediction employing the trained CNN, which has the same procedure as testing. Performance of the CNN method is evaluated on the whole dataset.

Data
In order to train and evaluate the CNN model in the task of the Chla concentration estimation, Aqua MODIS level 3 monthly standard mapped image Rrs and OC-CCI monthly Chla concentration in version 3.1 are used as input and ground truth, respectively. Monthly data of January 2016 will be used for training and validating, and monthly data of the whole year (include January) will be used for testing. Details are shown in Table 2. Usually, there is only a training and testing data set for inverse problem. Due to the fact that CNNs own large parameters, it is a common practice to use a validation data set to help tune the parameters. The derived CNN  model is validated after some iterations by calculating the cost function on the validation data using the current model. The CNN with the smallest validation error will be selected as the trained CNN. The Rrs products are obtained from the NASA ocean color website. 32 These data have a monthly temporal resolution and 4 km (at the equator) spatial resolution with global coverage in equirectangular projection. The atmospheric correction algorithm employed by NASA has been illustrated in detail in the previous studies. [33][34][35] We download Rrs images at 443, 488, 547, and 667 nm of each month in 2016. These wavelengths are selected because they are the main input data for the OC3 and the CI algorithms that have been used in the OC-CCI processing chain. In total, 48 images, with 4 images for each month, are used with the same image size, 4320 × 8640.
On the other hand, 12 images from the year 2016, with one image for each month, derived from the OC-CCI Chla concentration product, are used as ground truth. All Chla images have the same spatial resolution, the coverage, the map projection, and the size as the Rrs images described above, and as such they can be readily used for establishing the relationship between MODIS images and the OC-CCI Chla concentration estimation. In addition, according to the OC-CCI processing chain, all the possible ocean color satellites should be used. For the Chla concentration product in 2016, MODIS and VIIRS are the only available data to merge due to the service period. Further information about the Chla data can be found in Ref. 36.

Preprocessing
The ground-truth OC-CCI Chla images are logarithmically transformed due to the heavy-tailed distribution of Chla. [37][38][39][40] The four input MODIS Rrs images associated with a ground-truth image are scaled to achieve the same range as the log-transformed Chla (Table 3). Comparing with the normalization operation, 41-43 the logarithmic transformation proves in this study to be more effective way to improve the performance of a CNN. Pixels that had been flagged for land, clouds, failure in atmospheric correction, stray light, etc., are assigned the value of NAN to be excluded from your training the algorithm.
An image patch of 15 × 15 ð60 km × 60 kmÞ pixels scans the whole Rrs images to generate samples used in this study. Because the Chla image has the same size with the Rrs images, it is convenient to obtain the corresponding Chla concentration located at the patch center. If an image patch contains pixels of NAN value or lies outside the boundary, it will be abandoned. A total of 18,566,240 sample pairs are created, each of which consists of a 3-D matrix of input 4 × 15 × 15 image patch and a float scale value of the output label. Validation is crucial to train a strong CNN model, and nearly 3,000,000 samples are selected for validation and the rest 15,000,000 samples are for training. Moreover, the preprocessing procedure is the same during testing.

Structure of the Patch-Based Regression CNN
The structure of the patch-based regression CNN is demonstrated in Fig. 3. The main difference from Fig. 1 is the last layer which is specifically designed by outputting only one value to solve regression problem. The CNN consists of two convolutional layers, two pooling layers and two fully connected layers. Strategy of how to tune the parameters to get the structure of the CNN will be explained in Sec. 5.
The input X is image patch with the size of 4 × 15 × 15. The output of the convolution layer 1 is 3-D feature map a ð1Þ with the size of 20 × 13 × 13, which is a result of the convolution of input patch with the kernel W ð1Þ with stride 1. A max-pooling layer is followed to reduce the size to 20 × 6 × 6, and the maximum operation is chosen in the pooling layer owing to its simplicity and effectiveness. Another convolution of kernel W ð2Þ with stride 1, following a max-pooling of process on a ð2Þ , yielding feature map a ð3Þ with the size of 50 × 4 × 4 and a ð4Þ with the size of 50 × 2 × 2. The first fully connected layer takes input of dimension ð50 × 2 × 2Þ ¼ 200 and outputs 100 neurons. Unlike employing activation function sigmoids in the classical LeNet, the activation function ReLU, maxf0; zg, is used in each convolutional layer and the first fully connected layer because it helps accelerate convergence in training. 44 Different with the classification problem that the output number of the last fully connected layer is the number of the class, to address the regression here, the last fully connected layer takes as input the 100 neurons and yields one neuron which corresponds to the Chla concentration value.

Training and Testing
After modeling a CNN network, a loss function needs to be defined which will be minimized by training. In place of using softmax loss, which is regularly adopted in classification CNNs, the Euclidean loss is adopted for this regression CNN to minimize the error between the CNN output and the ground-truth Chla concentration provided by the OC-CCI Chla image. The loss function is equal to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 2 9 5 where W represents the weights in kernels, B means bias, N is the number of samples used in each mini-batch which is a small subset of training samples, 45Ŷ is the output of the CNN with the parameters, and Y is the ground-truth Chla concentration from OC-CCI Chla image.Ŷ is calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 6 ; 1 9 3Ŷ ¼ W ð6Þ · f½W ð5Þ · downff½W ð3Þ Ã downff½W ð1Þ Ã X þ B ð1Þ g þ B ð3Þ gð∶Þ þ B ð5Þ þ B ð6Þ ; (11) which is similar with Eq. (9) in Sec. 2 but without the softmax layer. After the loss function defined, the CNN is trained to minimize the cost using stochastic gradient descent (SGD) optimization algorithm. 46,47 The update rule of weights W in SGD is given as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 1 2 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 8 Fig. 3 Structure of the patch-based regression CNN. The order of 3-D X and a are channel × height × width. Kernels W in each convolution layer are four-dimensional and the number before at stands for number of the 3-D kernels, so the order of the four-dimensional kernels W is number × channel × height × width. The red T n is the residual error in the n'th layer during backpropagation which has the same size as the corresponding a.
where W iþ1 and ΔW iþ1 are weights and weight update at iteration i þ 1, the learning rate ε decides how much to learn in each step, the momentum α is motivated by physical perspective for optimization problem to accelerate converge rates, and the weight decay λ helps to avoid overfitting. In Fig. 3, T n is the residual error in the n'th layer, which has the same size as the corresponding a. ∂L ∂W will be calculated in each layer from back to head. The ∂L ∂W in fully connected layer 2 will be computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 6 5 0 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 5 9 5 The ∂L ∂W in fully connected layer 1 will be computed as T ð5Þ ¼ f 0 ½W ð6Þ T · T ð6Þ ; where f 0 ð·Þ represents the inverse function of ReLU. No weights and bias need to be trained in pooling layers and the ∂L ∂W in convolutional layer 2 will be computed as T ð3Þ ¼ f 0 fup½T ð4Þ g; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 1 1 6 ; 4 0 1 where ⊗ represents the reverse convolution operation and upð·Þ represents upsampling. The ∂L ∂W in convolutional layer 1 will be computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 1 1 6 ; 3 6 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 1 1 6 ; 3 0 9 T ð1Þ ¼ f 0 fup½T ð2Þ g: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 3 ; 1 1 6 ; 2 8 6 The update rule of bias B is same as weights W and the ∂L ∂B in each layer will be computed as We train the CNN for 100,000 iterations with mini-batches size of 128. During the training, validating is carried out every 1000 iterations with exploiting 3000 forward passes each iteration on a validating mini-batch size of 1000. All the parameters for this CNN are chosen after comprehensive experiments with different values, and the one with the best validating performance will be selected. Learning rate, momentum, and weight decay are set to be 0.0001, 0.9, and 0.0005 respectively. Once the patch-based CNN model is trained, Chla concentration will be estimated by applying the trained model on the Rrs images pixel by pixel. Evaluation for the testing is processed by the corresponding Chla images.

Comparison with SVR
For the purpose of comparison, an SVR is adopted to estimate the Chla concentration using the same data source as the CNN method. The SVR solves complex regression problems by calculating the linear regression function in a higher-dimensional feature space where the input data are converted by a nonlinear function. The SVR is implemented using the fitcsvm in MATLAB. Grid search method is used to tune the soft margin parameter and the Gaussian kernel parameter to maximize performance, and the two tuned parameters are 1 and 2 finally. Unlike the patchbased CNN method, the Chla concentration is performed in a pixel-based manner. Since SVR in MATLAB cannot handle large training data, only 50,000 representative samples are adopted for training.

Numerically Assessment of the Method
where Chla CNN denotes the Chla concentration values estimated by the CNN model and Chla OC-CCI denotes the ground-truth OC-CCI Chla concentration values. Therefore, R 2 is a measurement of the correlation between the predicted and the ground-truth data sets, and RMSE is used as a measurement of absolute error. 48 The pixels with NAN values in the CNN prediction map or the OC-CCI Chla ground-truth map are not used for calculating the above-described measures. Moreover, because the natural distribution of Chla is lognormal, the evaluation is in log 10 space. 43,49,50 4 Results Using the patch-based CNN method, 12 monthly global Chla concentration images are predicted, and the prediction of each image takes about 2 h. A one-month Chla concentration map estimated from the Rrs images by the CNN model and the SVR model, as well as the corresponding OC-CCI ground-truth image are shown in Fig. 3. For both SVR and the CNN predictions, values below 0.001 or above 100 mg m −3 (representing <0.001% of the Chla estimates) are removed to limit the product as same as the OC-CCI ground-truth image. From Figs. 4(a) and 4(b), we can see that the prediction map achieved by the CNN model has a strong spatial consistency with the OC-CCI ground-truth Chla concentration map, which is achieved by assimilating various sources of information, such as the proximity to land, the depth of the ocean, and ocean currents. 51 Figure 6(a) shows the scatter plots of the predicted values achieved by the CNN Fig. 4 Spatial distribution of the monthly global Chla concentration in March 2016 (a) obtained using the CNN method, (b) the OC-CCI data, and (c) obtained using the SVR method. In general, Figs. 4(a) and 4(c) have strong spatial consistency with Fig. 4(b). But in the three red rectangle areas, Fig. 4(c) is much deeper red than Fig. 4(a), which means that heavier overestimation happened in the prediction map achieved by the SVR method than the CNN model. model and ground-truth OC-CCI Chla concentration values. In a scatter plot, good performance means that points are close to the equal-value line. Figure 6(a) shows that the majority of data points distribute around the equal-value line, indicating that the CNN method owns the capacity to predict very accurately the Chla concentration information in OC-CCI data, by learning the complex relationship between the Rrs spectra data and the OC-CCI data.
In Table 4, we can see that the R 2 achieved by the CNN model is between 0.879 and 0.918 with the mean of 0.901, indicating a high consistency between the Chla concentration values predicted by the CNN model and the ground-truth OC-CCI values. The other numerical  measures also indicate small error and high accuracy of the proposed CNN model, i.e., the overall RMSE of the estimated Chla concentration is 0.129, and the max bias and MAPE are 0.054 and 26.687, respectively. The max and min of MAE are 0.153 and 0.102, respectively, and the mean value is 0.125. We want to highlight the fact that although the proposed CNN model is only trained on the month 1 data, it achieves stably good performance on the other 11 months' data, indicating a strong generalization capability of the CNN model. Figure 7 shows the RMSE and R 2 performance comparison between the CNN model and the SVR method on 12 months Chla concentration estimation. We can see that the R 2 curve achieved by the CNN model is above the curve of the SVR method, indicating that CNN can outperform SVR over all the 12-month predictions. Moreover, the RMSE achieved by CNN are also consistently lower than SVR. The mean RMSE and R 2 of the 12 months are (0.129, 0.901) and (0.175, 0.828) for CNN and SVR, respectively. Figures 4(c) and 5 shows that the color of some case 2 water areas are much deeper red than Figs. 4(a) and 5, i.e., Yellow Sea and East China Sea (red rectangle 1), English Channel and the south of North Sea (red rectangle 2), and the east of Bering Sea (red rectangle 3), which means that heavier overestimation happened in the prediction map achieved by the SVR method than the CNN model. Figure 6 supports the visual analysis above, i.e., although both the CNN model and the SVR method overestimate in the high Chla concentration areas, the SVR method performs much worse. From Fig. 6, we can also see that the CNN model outperforms the SVR method in the low Chla concentration areas.

Discussion
Based on the above experimental results, we can summarize that the CNN approach can better learn the complex relationship between the Rrs and the OC-CCI Chla values than SVR, demonstrating the possibility that the CNN may be used for building an end-to-end approach for efficient Chla concentration estimation. Song et al. 52 proposed a novel inversion method for rough surface parameters using CNN, which microwave images of rough surfaces are used for training. In Ref. 53, a CNN-based method is employed to establish the solar flare forecasting model, in which forecasting patterns can be learned from line-of-sight magnetograms of solar active regions. Line-of-sight magnetograms of active regions observed by SOHO/MDI and SDO/HMI, and the corresponding soft x-ray solar flares observed by GOES generate the dataset for training. Previously CNN has been used for rough surface parameters and solar flare inversion. Here we demonstrate that it can also be used for Chla concentration inversion by learning the complex nonlinear relationship between the Rrs and the OC-CCI Chla values.
CNNs have a large number of hyperparameters and in this paper the patch size, kernel size, number of kernel, neuron number in the fully connected layer, number of layers, and parameters for training need to tune to obtain good performance for estimating global Chla concentration. Adopting grid search method to obtain the best combination of hyperparameters is very timeconsuming. In this experiment, the hyperparameters are tuned in the flowing strategy. Structure of a CNN mentioned in Fig. 1 is the classical LeNet, and the default hyperparameters in this experiment are based on the parameters in Fig. 1. Number of layers will first to be changed manually from Fig. 1 to generate a few experiment groups with different depths. Then other hyperparameters in each group will be tuned to get the best performance CNN model which owns the structure described in Fig. 2.
In this paper, only January 2016 data are used for training and the whole-year 12-month data for testing. Still, CNN works very well, therefore CNN has strong robustness and generalization capability, indicating that it may be used for predicting long-term Chla concentration without the need to fine tune using the new observations. The mean R 2 of CNN is 0.901, which is very close to OC-CCI Chla concentration, suggesting that using MODIS data only, the CNN approach can approximate the OC-CCI products, which is probably due to the high learning capability of the CNN that can fully take advantage of the information in MODIS by exploiting the complex nonlinear relationship between the Rrs and the OC-CCI Chla values. In addition, the OC-CCI Chla concentration dataset is adopted as the ground truth because it is well-recognized Chla concentration product that is produced by assimilating different satellite data through a complex data processing steps. The GPU Nvidia Quadro M2000 is used in this study, with the GPU memory of 4 GB, NVIDIA CUDA cores of 768, and boost clock of 1180 MHz. A more powerful GPU like Nvidia GeForce GTX 1080 Ti with the GPU memory of 11 GB, NVIDIA CUDA cores of 3584, and boost clock of 1582 MHz will help deal with big data. With better GPU, more GPUs, and batch processing during the prediction stage, we can further shorten the processing time.
Although the mean RMSE and R 2 of CNN are acceptable, overestimation of CNN can be found from Figs. 4(a), 5, and 6(a) in the high Chla concentration areas, which may be caused by the imbalance quantities of the case 1 and the case 2 waters training samples. During sample selection, we scan the whole global Chla image to train the CNN model, resulting in the case 1 samples are much more than case 2 samples. More comprehensive studies, like combining sea surface temperature and digital bathymetry model data as input, are required to in the next stage to solve the overestimation problem. Although OC-CCI is the most accessible and reliable Chla concentration product, uncertainties may exist that will affect the performance of model. In the future, if more reliable data appear, further studies will be conducted to train and evaluate the CNN model.

Conclusion
In this study, a CNN method was applied to MODIS Rrs images to estimate global Chla concentration. The CNN took the patches of four MODIS Rrs images as input and generated the Chla concentration directly. The OC-CCI Chla concentration image was used as ground-truth for training. A total of 12 monthly global Chla concentration images were produced and the generation of each image takes about 2 h. Qualitative and quantitative analyses were used for evaluation and comparison analysis, and the results implied that the CNN constitutes an accurate, fast, and robust method for the estimation of the global Chla concentration. Considering the big data characteristic of remotely sensed global Chla data, these characteristics of the CNN model is of great importance.

Appendix
All experiments are conducted on a 64 bits Intel Xeon E5-2640 v3 workstation with 2.6 GHz of clock and 32 GB of RAM memory. A GPU Nvidia Quadro M2000 with 4 GB memory and version 8.0 CUDA. Caffe, 54 a popular deep learning framework especially for CNN which is short for convolutional architecture for fast feature embedding, under Ubuntu 14.04 LTS is used in this paper. Caffe provides a convenient way to implement the CNN models by defining the architecture of CNN such as the number of layers, the type of layer and the strategy for optimization. The preprocessing of the data and the numerically assessment are performed under MATLAB R2014a. The data type of CNN in Caffe for reading image patches and their corresponding Chla values is of the HDF5 format instead of the default Lightning Memory-Mapped Database format to achieve easily storage of multichannel images and float labels.