Urban land surface temperature prediction using parallel STL-Bi-LSTM neural network

Abstract. Accurate temperature prediction is of great significance to human life and social economy. A series of traditional methods and machine learning methods have been proposed to achieve temperature prediction, but it is still a challenging problem. We propose a temperature prediction model that combines seasonal and trend decomposition using loess (STL) and the bidirectional long short-term memory (Bi-LSTM) network to achieve high-accuracy prediction of the daily average temperature of China cities. The proposed model decomposes the temperature data using STL into trend component, seasonal component, and remainder component. Decomposition components and the original temperature data are input into the two-layer Bi-LSTM to learn the features of the temperature data, and the sum of prediction of three components and the original temperature data prediction result are added using learnable weights as the prediction result. The experimental results show that the average root mean square error and mean absolute error of the proposed model on the testing data are 0.11 and 0.09, respectively, which are lower than 0.35 and 0.27 of STL-LSTM, 2.73 and 2.07 of EMD-LSTM, 0.39 and 0.15 of STL-SVM, achieving a higher precision temperature prediction.


Introduction
Temperature is closely related to human life, agricultural production, and social economy, and it affects all aspects of life. Studies showed that the increase in temperature reduced crop yields. 1,2 Doi et al. 3 presented that temperature change had an impact on deep-sea biodiversity. Dottori et al. 4 proposed river floods with warming would increase human and economic loss. Temperature change also increased building energy consumption. 5 The most serious issue is that temperature change would have an impact on the spread of diseases and endanger people's lives. 6,7 Accurate prediction of temperature is significant to protect people's lives and property and maintain stable economic development. However, temperature prediction is very challenging due to various uncertain relevant factors.
A series of forecasting methods including conventional methods and machine learning methods were proposed to predict temperature. Wang et al. 8 proposed an improved support vector machine (SVM) to predict the daily minimum temperature; Babu et al. 9 used different autoregressive integral moving average (ARIMA) models to predict the average global temperature; Jallal et al. 10 proposed an artificial neural network (ANN) with delayed exogenous input to forecast air temperature on a half-hour scale. With the appearance of recurrent neural network (RNN), more and more methods based on RNN are used to solve the problem of temperature prediction. The long short-term memory (LSTM) network is one of the most popular methods. Li et al. 11 provided temperature prediction results every half an hour using stacked LSTM network. Qi and Guo 12 predicted the next hour's temperature using the LSTM network by fully considering the historic temperature and meteorological condition. Huang et al. 13 proposed multistep temperature prediction model using the LSTM network based on temperature data of surrounding cities. Sadeque and Bui 14 provided cascaded LSTM network for weather forecasting that can outperform some of the existing well-known models. Joanna et al. 15 presented an outdoor air-temperature time-series prediction model for a multifamily building using ANNs and obtained outstanding prediction results by selecting the best combination of predictors and the optimal number of neurons in a hidden layer. Wang et al. 16 proposed the development and evaluation of a new algorithm based on pattern approximate matching to predict the temperature of five cities in China. Yu et al. 17 presented an air temperature forecasting framework based on graph attention network and the gated recurrent unit, which overcame the flaw of the conventional graph network and achieved the best performance. Hrachya et al. 18 implemented a weather prediction technique based on machine learning to improve the hourly air temperature prediction for up to 24 h. Toni et al. 19 combined LSTM and prophet model to forecast 5-year daily air temperatures in Bandung; the results showed that the combination of two networks performed well for the prediction of low temperature and high temperature.
For time series data prediction, time series decomposition methods have an inspiring effect on improving the accuracy of time series data forecasting. Zhang 20 achieved foreign exchange rate forecasting using the combined model of empirical mode decomposition (EMD) and LSTM network. Jin et al. 21 suggested a vegetable price forecasting model using seasonal and trend decomposition using loess (STL) and LSTM network. Wang and Lou 22 proposed a hydrological time series forecast model based on wavelet denoising and ARIMA-LSTM that can be well adapted to the hydrological time series forecast and has the best forecast effect. Duan et al. 23 forecasted base station traffic using STL-LSTM networks, which have better performance compared with the other algorithms. Huo et al. 24 solved the problem of long-term span traffic prediction using STL and LSTM model. Yin et al. 25 updated the STL-LSTM model using an attention mechanism to achieve high accuracy of vegetable price forecasting. Chen et al. 26 forecasted the short-term metro ridership using the STL-LSTM model and proved it can achieve high accuracy.
In this paper, we propose a prediction model combining the STL method and bidirectional long short-term memory (Bi-LSTM) neural network to achieve the prediction of daily average temperature. Because temperature is affected by a variety of uncertain factors, it is difficult to obtain satisfactory results in temperature prediction directly using deep learning models. The time series decomposition method can be used to decompose periodic time series into trend components, seasonal components, and residual components. In a general periodic time series, the trend component generally represents the low-frequency variation, whereas the seasonal component represents the periodic variation. For temperature data, the seasonal component represents the periodic fluctuation of temperature with seasonal changes. The trend component represents changes in temperature that are influenced by other factors, such as increase in carbon dioxide. This paper first uses the STL method to decompose temperature data into trend component, seasonal component, and remainder component and then inputs the decomposition components and the original data into the two-layer Bi-LSTM neural network for training. The final output of the network is the predicted temperature.
The structure of the paper is as follows: Sec. 1 is the introduction; Sec. 2 introduces the study area; Sec. 3 is the related works, including the STL method, LSTM model, and Bi-LSTM model; Sec. 4 shows the structure of the proposed model; Sec. 5 is the experimental results; Sec. 6 is the conclusion.

Study Area
China is located in eastern Asia and on the western coast of the Pacific Ocean. It has a vast territory with a total land area of ∼9.6 million km 2 . The terrain of China is higher in the west and lower in the east and is distributed in a stepped manner. The combination of temperature and precipitation is diverse, forming a diverse climate. There are a total of 34 provincial-level administrative units, including 23 provinces, 5 autonomous regions, 4 municipalities, and 2 special administrative regions.

Materials and Methods
We introduce the materials and related methods including seasonal and trend decomposition using loess, LSTM model, and Bi-LSTM model.

Materials
The temperature data for training and testing are land surface temperature, which are acquired from the weather station around the capital cities of China's 34 provincial-level administrative regions through Ref. 27. In the official documentation, the daily average temperature refers to the mean temperature calculated from the temperature of the day for 24 h in degrees Fahrenheit to tenths. We downloaded the daily average temperature of the mentioned 34 cities from 2010 to 2020 for training and testing, and there are about 136,510 pieces of temperature data, in which 95% of these data are used for training and the rest of them are used for testing. To test the robustness of the proposed model, additional testing data are added in the testing stage. According to the climatic conditions and geographic location, China is regionalized into four regions: north region, south region, northwest region, and Tibetan region (four regions). 28 Fifteen cities were selected from each region evenly to test the robustness of the model to the temperature of different regions. About 240,900 pieces of daily average temperature, data from 2010 to 2020 were downloaded and all of these data are used for model testing.

Seasonal and trend decomposition using loess
STL is a time series decomposition method proposed by Cleveland et al., 29 which can decompose a time series into trend, seasonal, and remainder components based on loess. Suppose there is a temperature time series X v , STL can decompose X v into three addictive components: trend component (T v ), seasonal component (S v ), and remainder component (R v ), which can be expressed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 3 1 6 Suppose x i and y i are measurements of an independent and dependent variables for i ¼ 1 to n. gðxÞ is a smoothing of y given x that can be computed for any value x along the scale of the independent variable. That is, loess is defined everywhere and not just at the x i ; as we shall see, this is an important feature that in STL will allow us to deal with missing values and detrend the seasonal component in a straightforward way. That is, loess can be used to smooth y as a function of any number of independent variables, but for STL, only the case of one independent variable is needed. gðxÞ is computed as follows: given a positive integer q, when q ≤ n, the q values of the x i that are closest to x are selected and each is given a neighborhood weight based on its distance from x. Let λ q ðxÞ be the distance of the q'th farthest x i from x. Let W be the tricube weight function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 1 7 7 The neighborhood weight for any x i is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 1 2 1 The next step is to fit a polynomial of degree d to the data with weight v i ðxÞ at ðx i ; y i Þ.
The value of the locally fitted polynomial at x is gðxÞ.
STL consists of two procedures: one is an inner loop and the other is an outer loop. The inner loop is nested inside the outer loop. The inner loop is used to update the seasonal component and trend component; the process of k'th epoch is as follows: (1) Detrending. Calculate a detrended series by subtracting trend series from original series: (2) Cycle-subseries smoothing. Each cycle-subseries obtained from step (1) is smoothed by loess and the result is the preliminary seasonal series C ðkþ1Þ t , consisting of N þ 2× frequency values that range from v ¼ −frequency þ 1 to N + frequency, in which N is the length of data.
(3) Low-pass filtering of smoothed cycle-subseries. The preliminary seasonal series of step (2) is processed by a low-pass filter consisting of moving average of length frequency and the remainder trend series L

LSTM model
The RNN is designed to process sequence information, such as speech recognition and machine translation. The disadvantage of this method is long-term dependencies which will lead to gradient disappearance. The emergence of LSTM fix this problem. Different from RNN, LSTM proposed by Hochreiter and Schmidhuber 30 adds structure of forget gate, input gate, and update gate to forget and update information to the cell state. The structure of LSTM is shown in Fig. 1. The first step of LSTM is to determine what information of cell state needs to discard, which is handled by forget gate using a sigmoid function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 3 2 2 where x t is the input of current step, h t−1 is the hidden state of the previous step, W f is the learnable weight of the forget gate, b f is the learnable bias, σ is the sigmoid function, and f t is the output of the forget gate. The input gate determines what new information obtained from the current input x t can be saved in the current cell state C t . This process has three steps. At first, determine what information to update from the current input x t using a sigmoid function. Then, obtain a candidate vectorC t using a tanh function. Finally, update current cell state C t in terms of cell state of previous step C t−1 and candidate vectorC t . The operations are as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 1 9 0 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 1 4 7C E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 1 2 4 The output gate determines the output state of current step: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 1 0 0 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 6 3 h t ¼ o t Ã tan hðC t Þ;

Bi-LSTM model
Bi-LSTM is based on LSTM and combines the information of the input sequence in both the forward and backward directions, which can better capture the two-way semantic dependence. The structure of Bi-LSTM is shown in Fig. 2. The LSTM unit has limitations, and it is able to make predictions using past data but not future data. Bi-LSTM overcomes the limitations of LSTM, it consists of two different LSTM hidden layers with opposite output directions. Under this structure, both backward and forward information can be utilized in the output layer.
Bi-LSTM has advantages that LSTM does not have, so we choose the Bi-LSTM network as the temperature prediction network.

Proposed Model
To improve the accuracy of temperature prediction, we built a temperature prediction model combining STL and Bi-LSTM named parallel STL-Bi-LSTM neural network.  As shown in Fig. 3, the proposed model mainly has five steps: at first, original temperature data are decomposed into three components: trend component, seasonal component, and remainder component using the STL method. Then, original temperature data, trend component, seasonal component, and remainder component are input into two-layer Bi-LSTM neural network. The three-part decomposed data obtain one predicted value through the two-layer Bi-LSTM network and the fully connected layer, respectively, then adding the three predicted values to obtain one predicted value of the decomposed data. The origin temperature data obtain one predicted value through the two-layer Bi-LSTM network and a fully connected layer. The two predicted values are merged to obtain a feature containing two predicted values. The final predicted value is obtained by adding the two values through a fully connected layer using learnable weights.

Experiment
The experiment has three parts: first, the original temperature data are processed to input into the proposed model. Second, we determine the optimal parameters of the proposed model based on the performance of different parameters on the same testing data. Third, the proposed model is evaluated on testing data and the prediction results are compared with other prediction models.

Data Preparation
The temperature data need to be processed before inputting into the neural network. For the input of the proposed model, the original temperature data and the trend component, seasonal component, and remainder component need to be converted to input and output pairs for training. The process is as follows: decompose original data using the STL method into three components: trend component, seasonal component, and remainder component. Figure 4 shows an example of STL decomposition results. Frequency is the number of observations in each period, or cycle, of the seasonal component. The trend component in different years is different, because the temperature in different years is different and the decomposed trend data will also be different. The seasonal component obtained by STL decomposition is variation in the data at or near the seasonal frequency. 29 After using the curve connection, it looks like a continuous waveform, but the actual meaning is still regularly changing data. The seasonal component is the regular data of the decomposition period, which mainly analyzes the regular change of the irregularly changing data. Then set time_step, which denotes how much data are used to predict the next value. Use a sliding window of length time_step to traverse the data to generate the training set, and select the value of time_step + 1 as the corresponding predicted value. Generate the input and output pairs of original data and the trend, seasonal, and remainder components for the proposed model.

Evaluation Metrics
To quantitatively evaluate the performance of the proposed model in temperature prediction, we used root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination R 2 for evaluation and comparison. The equation of each evaluation index is as follows: where y is the true value, y 0 is the predicted value, y is the mean value, and n is the number of data in the testing set.

Parameters Setting
The proposed model consists of four parallel two-layer Bi-LSTM and fully connection layer. The units of two-layer Bi-LSTM are set to 50. The activation function is tanh. To avoid overfitting, the dropout rate is set to 0.2. And we use Adam optimizer with a learning rate of 0.0001. Our model and compared models are trained on GeForce RTX 2070 Super using python = 3.6.1 and keras = 2.2.4.
To determine the optimal parameters of proposed model, different parameters were used to train the model. We evaluate the performance of different parameters based on RMSE and MAE to obtain best model.
For the frequency of STL decomposition, we decomposed temperature data using different frequencies to test the forecasting result. Tables 1 and 2 show RMSE and MAE on part of testing set with different decomposition frequencies. When decomposition frequency is 2, the proposed model has the best prediction result. We think that a small decomposition frequency strengthens the learning of features at adjacent time points, and as the decomposition frequency increases, the effect weakens.
At the same time, we also test the influence of different time_step and batch-size on the prediction results. The time_step is set to 30 and batch-size is set to 32.
The decomposition frequency and the time_step are not parameters of the network, but two parameters affect the input of the network. The final result also shows that the frequency of 2 has the best training effect, and using a small time_step may not get a good prediction result. Using a large time_step will get an outstanding prediction result, but the training time will increase, so choosing a moderate time_step is required. Regarding the parameters of the network, after the above two parameters are determined, the optimal training results are obtained by adjusting the learning rate and batch_size.

Model Evaluation
The training set and testing set are separated from temperature data of 34 cities. Additional testing data of four regions are added to test the robustness of the proposed model. We tested the model using the testing set and the temperature data of four regions and calculated the average RMSE and MAE to compare with other prediction models.
The proposed model can be divided into two parts according to the input: the input part using STL decomposition components (part 1) and the original temperature data input part (part 2).  To prove that the proposed model combines the advantages of the two parts to improve the prediction effect, we used the same parameters to train the model using the STL decomposition components input and the model using the original data input. It is shown in Tables 3 and 4 that the proposed model achieves good prediction results for the temperature in different regions, and the proposed model is better than the other two comparative models in the prediction results. Compared with the model that uses the original temperature data, the proposed model greatly improves the prediction accuracy. At the same time, the high accuracy of the model that uses STL decomposition components input also shows that the combination of the time series decomposition method can improve the accuracy of the model.
The proposed model was compared with other time series prediction models composed of time series decomposition methods combined with LSTM neural network. At the same time, the proposed model was compared with the linear regression model (LR) and the nonlinear regression model SVM, in which SVM uses temperature data and temperature data combined with STL decomposition data as input. Finally, we compare our model with the Bi-LSTM model used by Liang et al. for atmospheric temperature prediction. 31 The comparison results are shown in Tables 5-7. The average of RMSE and MAE of the proposed model, STL-LSTM, and EMD-LSTM are 0.11 and 0.09, 0.35 and 0.27, 2.73 and 2.07, respectively. We can find that the RMSE and MAE of the proposed model are significantly lower   than the comparison networks STL-LSTM and EMD-LSTM, indicating that the proposed model improves the accuracy of temperature prediction and can predict temperature more accurately. From the comparison results, the neural network-based methods outperform the regressionbased methods in temperature prediction. Table 7 shows that the proposed model has the highest coefficient of determination, proving that the proposed model has outstanding performance and   successfully realize the city temperature prediction of China. Figure 5 shows the visualization of predicted and actual temperature of Jixi using the proposed model, which also indicates that the proposed model can obtain outstanding prediction results.

Conclusion
This paper proposed a parallel STL-Bi-LSTM model, which combined STL and Bi-LSTM model to realize the prediction of city daily average temperature. Experiments showed that the prediction model combined with the STL can greatly improve the prediction accuracy. Moreover, the prediction loss of the proposed model is smaller than that of STL-LSTM and EMD-LSTM, which proved that the proposed model is very suitable for temperature prediction. The prediction model proposed in this paper can theoretically be used for temperature prediction in other countries, provided that sufficient temperature data of the country is used for network training, and the optimal network weights are obtained to realize the temperature prediction of the country. In this study, we only predicted temperature using decomposition components and original temperature series, more impacting factors will be considered into the model to improve the prediction accuracy in future studies.
Lingling Ma received her PhD from the Institute of Remote Sensing Applications, Chinese Academy of Sciences, Beijing, China, in 2008. She is currently a professor at the Aerospace Information Research Institute, Chinese Academy of Sciences. Her research interests include the calibration, validation, and quality assurance in remote sensing.
Bohui Tang received his PhD in cartography and geographical information system from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, in 2007. He is currently a professor at the Kunming University of Science and Technology, Kunming. His research mainly includes the retrieval and validation of surface net radiation and surface temperature.
Ronglin Tang received his PhD in cartography and geographical information system from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China, in 2011. He is currently a professor at the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences. His research interests include the remote sensing retrieval and validation of land surface evapotranspiration and soil moisture.
Kun Shao is an associate professor at the School of software, Hefei University of Technology. His research interests include software modeling and development, software requirements analysis and modeling, graphics and image processing, etc.
Xinhong Wang received his PhD in cartography and geography information system from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (CAS), in 2008. He is an associate professor at the Aerospace Information Research Institute, CAS. His research interests include performance evaluation of remote sensors and quantitative infrared remote sensing.