Machine learning approach to locate desert locust breeding areas based on ESA CCI soil moisture

Abstract. Desert locusts have attacked crops since antiquity. To prevent or mitigate its effects on local communities, it is necessary to precisely locate its breeding areas. Previous works have relied on precipitation and vegetation index datasets obtained by satellite remote sensing. However, these products present some limitations in arid or semiarid environments. We have explored a parameter: soil moisture (SM); and examined its influence on the desert locust wingless juveniles. We have used two machine learning algorithms (generalized linear model and random forest) to evaluate the link between hopper presences and SM conditions under different time scenarios. RF obtained the best model performance with very good validation results according to the true skill statistic and receiver operating characteristic curve statistics. It was found that an area becomes suitable for breeding when the minimum SM values are over 0.07  m3  /  m3 during 6 days or more. These results demonstrate the possibility to identify breeding areas in Mauritania by means of SM, and the suitability of ESA CCI SM product to complement or substitute current monitoring techniques based on precipitation datasets.

to complete their life cycle. 5 During this final stage, the locusts are very mobile and can travel great distances. 6 Alike to other species in the animal kingdom, desert locusts have a phase polyphenism that implies drastic changes when population density increases, either in adult or nymph stage. 2,7,8 Even though behavioral gregarization may occur within hours, 9 it takes several generations to fully display gregarious characters. 10 The phase transition induces physiological changes in lifespan, metabolism, immune responses, and reproductive physiology. 11,12 In their solitarious phase, locusts are generally bigger 10 and they present higher fecundity and smaller eggs. 13 Solitarious desert locust populations are usually constraint into the recession areas, where annual rainfall is <200 mm. 14 However, they are able to increase rapidly their numbers when suitable conditions are met. 4 These insects are very well adapted to arid environments with erratic but sometimes high intensity precipitation episodes. 15 Some environmental events such as green vegetation blooms or rainfall are closely linked to the desert locust development, having triggering effects and enhancing outbreaks. 16,17 Temperature variability has also been demonstrated to have effects on some Schistocerca species as described by Ref. 18. This work indicated that the frequency of locust outbreaks may be altered by changes in climatic patterns. Among many environmental factors that may affect locusts, SM is the variable that mostly influences egg-laying location, egg-survival, and egg-hatching rate, 19 in addition to temperature. 20 In general, female locusts prefer open and warm sites of dry, soft, and sandy soils in which over 6 cm of depth have enough moist soil conditions. 3,21 Successful breeding conditions are usually triggered by rainfall, which provides enough moisture to the soil enhancing egg laying, development, and hatching, 16 as well as an adequate vegetation for their hoppers to feed on. 6,14 The success of preventive measures is subjected to the inaccessibility of some important breeding areas. 5 Within the recession area, there are some seasonal breeding areas in which the lack of rain may cause that some are not infested for a particular year. So that, even though breeding areas are constraint to the recession area, they may vary in accordance to suitable ecological conditions. 5 Some authors have proposed the use of remote sensing platforms to monitor large and inaccessible locust breeding areas, 16,[22][23][24][25][26][27] which usually occur away from crops. 28 Remote-sensed vegetation and precipitation are being used to derive potential grasshopper and locust habitats 22 by means of satellite platforms as LANDSAT, NOAA, Meteosat, SPOT, TERRA, or AQUA. 29 International organizations such as the Desert Locust Information Service (DLIS) from FAO have been using earth observation methods since the 1980's to assess favorable environmental conditions to the desert locust. 29 However, monitoring arid environments can present some limitations. The vegetation is usually sparse and geomorphological features are not always well identified. 30,31 The normalized difference vegetation index (NDVI) is a proxy for vegetation presence 32 and it has been widely used to assess suitable environmental conditions for desert locust. 31 Nevertheless, this index is highly sensitive to the noise of the soil background. 33 NDVI values cannot be distinguished from sparse vegetation because bare soils have often spectral characteristics in the red and near-infrared. 34 Furthermore, the vegetation is drought tolerant due to adaptive mechanisms such as canopy architecture, leaf structure, and leaf angle. Another common proxy to identify suitable conditions for desert locust is precipitation. 35 Rainfall detection probabilities may range from 70% to 20% in arid and semiarid regions by means of remote sensing, with a high overestimation of rainfall occurrences. 36 Currently, there is an ongoing initiative "dEsert Locust earLy Survey (SMELLS)" from the European Space Agency (ESA) to derive SM with forecasting purposes. They propose to divide the month into three decades in order to provide averaged surface SM, which comes from daily estimates. According to this initiative, relevant ranges for locust monitoring are settled between 0.10 and 0.20 m 3 ∕m 3 . Satellite SM estimations stand out as a very useful tool to overcome the high uncertainty of precipitation in arid and semiarid areas, improving the probability of locust prediction. 37 In spite of being very promising, very few studies have addressed the link between SM remote sensing and desert locusts. 19 Traditional SM measures are ground based so that survey areas are usually limited for being an expensive and time consuming activity. 38,39 Laboratory and ground-based experiments have demonstrated that SM intervenes in egg development and interruption under particular conditions of humidity. 40 According to the same authors, eggs may remain viable in arrested state as long as 1.5 months, and then hatch after return to wet sand. In addition, locust densities are associated with relative high moisture availability. 41 These studies indicate that SM is a good proxy to identify desert locust, and it can substitute rainfall products. 42 Species distribution models (SDM) are numerical tools to analyze the link between species occurrences and environmental factors. They provide an ecological insight to predict species distribution over space or time given certain environmental characteristics. 43 Their machine learning methods increase traditional predictive performance and their capacity to incorporate complex interaction among variables, 44 being eligible to work with large ecological datasets. 45 The random forest (RF) 46 and generalized linear model (GLM) 47 are two commonly used machine learning algorithms to generalize species distributions. RF has been available for almost 20 years, and it performs very well in ecological predictions. 48 GLMs are mathematical extensions of linear models that do not force data into unnatural scales, and thereby allow for nonlinearity and nonconstant variance structures in the data. 49 They have also been used to analyze ecological relationships given their flexibility in comparison to classical Gaussian distributions. 50 The aim of this study is to identify suitable SM conditions for desert locust eggs as well as to hopper desert locusts in solitarious phase. It is based on SM estimations from satellite remote sensing imagery and ground-based observations of hopper desert locusts. We have used SDMs to better understand the link between SM and desert locusts to predict their likely distribution across landscapes and breeding areas. The study area is Mauritania and the survey period goes from 1985 to 2015.

Study Area
The study site is Mauritania, which is located in the Maghreb region of Western Africa (Fig. 1). We have chosen this study area to be one of the major breeding and recession regions for desert locust. 51 Mauritania is a vast country of 1;030;700 km 2 with large arid plains and only one continuous water flow, the Senegal River.
According to Koppen classification, 52 two climate types are present: hot desert climate "BWh" and hot semiarid climate "BSh." BWh is predominant in most of the country, which spatially coincides with part of the Sahara Desert (north) and the Sahelian belt (south). Rainfall is scarce and intense, being generally <150 mm∕year in average (Fig. 2). BSh accounts for the Southernmost strip, where the rainfall average is higher than 200 mm∕year, in addition to cooler and less fluctuating "day-night" temperatures.

Survey Data
Schistocerca WARning and Management System (SWARMS) is a database used by the Desert Locust Information Service (DLIS) at FAO for desert locust global monitoring and early warning. It compiles desert locust data since 1985 that have been collected by national survey and control teams of affected countries. It geo-locates field observations on a daily basis although some uncertainties may be expected. 26,53 For this study, we selected hoppers on a solitarious phase as the target population for two reasons: solitary phase accounts for nonrestricting conditions and hopper stage (wingless nymph) may have lower mobility than adults due to the lack of wings. There were 12,027 solitarious hopper sightings for the time span 1985 to 2015, spatially distributed as seen in Fig. 1. Even though the database contemplates the absence records, we have not considered them for two reasons. First, during the recession periods, individuals are mostly solitary (solitarious phase) and many times go unnoticed for survey teams. 54 Second, the number of absence records is very low, which causes unbalance between samples of presences and absences.

Satellite Data
The ESA CCI SM v03.2 is a multidecadal and global satellite-observed SM dataset generated via the climate change initiative (CCI) of the ESA. It is a product that combines various single active and passive sensors into three harmonized products: a merged active, a merged passive, and a merged from active and passive sensors. Based on the existing literature, these merged products generally outperform the single-sensor input products. 55 For the purpose of this study, we have used the merged active and passive product to be more complete. It uses the pixel from either the active or passive source, or the average value of both depending on the performance of the vegetation optical depth from the Advanced Microwave Scanning Radiometer for EOS (AMSR-E) C-band observations. 56 The combination of images from radar (active) and radiometer sensors (passive) provides information about the volumetric surface SM (up to 5 cm depth), and it is expressed in m 3 ∕m 3 units. Its spatial resolution is 0.25 deg and offers daily coverage worldwide from 1978 up to 2015. 55,57,58 This product comprises active data retrieved from C-band scatterometers on board of ERS-1, ERS-2, MetOp-A, and MetOp-B satellites (generated by the "TU Wien") and passive data obtained from microwave observations by the following sensors: Nimbus 7 SMMR, DMSP SSM/I, TRMM TMI, Aqua AMSR-E, Coriolis WindSat, GCOM-W1 AMSR2, and SMOS (generated by VU University Amsterdam in collaboration with NASA) ( Table 1). This product has been validated against ground-based reference measures or alternate estimates from other projects and sensors. 55,57 In general, ESA CCI SM dataset provides good estimations of SM with respect to land surface models and in situ observations. Nevertheless, it presents some uncertainties with particular surface conditions such as dense vegetation or organic soils, 55 which are not the case of our study area.

Methods
The ESA CCI SM v03.2 product was used to geographically compare the seasonal presence of solitarious hoppers of desert locust by months, with SM values from 1985 to 2015. Breeding areas in Mauritania vary widely throughout the year according to the National Centre for Prevention and Control of Desert Locusts in Mauritania (CNLA). During summer months, desert locusts usually breed in southern parts of the country. Whereas breeding occurs in the center and the northwestern part from September to December, and from December to May in the northern areas of Mauritania. 59 It is widely accepted that these insects have regional migrations following certain environmental conditions. 60 We have extracted the coordinates of each hopper in solitarious phase and its corresponding date from SWARMS database. Even though the database does have some absence records, we did not use them for being very unbalanced in comparison with presences. In addition to that, those records can be also considered as "pseudoabsences" owing to hoppers in solitarious phase may go unnoticed at low densities. 26 Thus, we found it convenient to randomly generate a grid of "pseudoabsences" as reported in other studies using SDMs. 61,62 Pseudoabsence samples were computed based on two principles. First, they were located within a maximum of 50-km radius mask created of ever desert locust presence (1985 to 2015), aiming to select areas with environmental and geophysical potentialities and to reduce geographical bias. We chose this distance for matching visually with the density map ( Fig. 1), where most of the areas with no presences are masked out. Otherwise, it could misguide SDM predictions. 63 Second, date allocation was done using a uniform random arrangement with R-software. Each pseudoabsence location was assigned a date within the first and the last hopper presence Table 1 List of satellite platforms, onboard sensors to measure SM at specific frequency, producer of the product, and time availability of each single product. 55 Platform sensor date of the SWARMS database (1985 to 2015). These pseudoabsence points were generated randomly and equally weighted to the presences (pseudoabsence and presence weighted sums are equal) for predicting species occurrences or distribution. 64 It may occur that some presences and pseudoabsences coincide geographically within the same pixel; however, it is very unlikely that they have the same assigned date. Each pseudoabsence date has been randomly allocated from 1985 to 2015, which implies that they will likely not have the same SM values. The duration of locust life cycles is variable, depending on the environmental conditions of the habitat, 65 nevertheless we rely on the following premises to create the variables in our study. Eggs are laid at 5 to 10 cm depth, and the egg incubation period may range from 10 to 65 days. 4 After hatching, nymph phase may last between 24 and 95 days since the egg was laid. Thus, under the most severe environmental circumstances, the maximum expected egg-hopper development time would be 95 days. 5 SWARMS database registers the sighting date and phase but not the age of each individual so that we have established up to 95 days prior the sighting record as the time analysis. Figure 3 shows the sequence of the proposed method as a flow chart.
Given the coordinates of each presence and pseudoabsence record, the corresponding daily SM value was extracted based upon the sighting or assigned date, up to 95 days backward. Based on these antecedent SM conditions, we generated variables dividing the analysis time into different time intervals (16,12,8, and 6 days) and assess the performance of the model with each of them. By this method, we aim to cover and differentiate critical events in the locust lifecycle such as egg-laying, egg-hatching, and early stages of the nymph phase individuals as well as to deal with punctual missing data (Fig. 3). Some areas of SM imagery had missing data due to the satellite revisit times used to generate ESA CCI SM v03.2. We have computed the minimum, mean, and maximum SM values within each time interval to obtain a representative value of such period. Then, we assess which descriptive statistic provides better information to the model in terms of performance. If no value was found for a particular time interval, the presence or absence record is not included in the model. In this way, we mitigate the effect that the missing information could provoke on the model results. Even though SM may vary greatly on a daily basis, 66 the biological evolution for egg and hopper development needs some days to be altered, 5 so that we found convenient this approach to generate the model variables.
Therefore, we have studied four different scenarios: A, B, C, and D. As previously mentioned, we have first extracted SM values, on a daily basis, up to 95 days before the presence or pseudoabsence date record. Each of the proposed scenarios contemplates a different division in terms of days: A = 16 days, B = 12 days, C = 8 days, and D = 6 days. Hence, we aimed to obtain one representative SM value per each subdivision of time, within each scenario. In order Fig. 3 Flowchart of the proposed methodology to study the link of ESA CCI SM with desert locusts using machine learning approach.
to acquire this representative SM value, we have computed the minimum, mean, and maximum out of the daily SM values contained in every time interval.
Thus, Fig. 4 shows variable creation for each scenario (A, B, C, and D) based on SM and presence and pseudoabsence dates. For instance, scenario (A) contemplates equal time intervals of 16 days so that (SM1) indicates the SM value on the local pixel between −95 and −80 days (both included) prior the presence or pseudoabsence date. (SM2) SM values on the local pixel between −79 and −64 days prior the presence or pseudoabsence date and the rest accordingly as detailed in Fig. 4. Time interval for scenario (A) is 16 days, which generates 6 variables; 12 days for (B) with 8 variables; 8 days for (C) with 12 variables; and 6 days for (D) with 16 variables. Time equals to 0 (t ¼ 0) corresponds to the presence or pseudoabsence sighting date. Within each scenario, three different alternatives are independently tested (minimum, mean, and maximum SM value within the given time interval).
Some publications suggest the suitability of machine-learning (ML) approaches to model species distributions, since they may perform better than the traditional regression-based algorithms. 44 In this study, we have used BIOMOD2 tool 67 implemented for R software. 68 We tested two different ML modeling techniques to describe and model the link between desert locust and SM: GLM 47 and RF. 46 GLM is a very popular modeling approach that has been widely used to model and predict habitats and species distribution. 69,70 The formula object was set to be "quadratic" (default) and the information criteria for the stepwise selection procedure was the Akaike information criteria. GLM approach implemented in BIOMOD2 only runs on presence-absence data, so binomial distribution family was used. RF algorithm is a flexible and easy to use ML approach that has been demonstrated to have good predictive performances in ecology and species distribution. 48 It can be used both for classification and regression problems. The most important tuning parameters are the "mtry" (number of variables randomly selected at each split of the tree as it grows) and "ntree" (number of trees). We have set these two parameters with their default values: "ntree" = 500 71,72 and "mtry" (in classification) = the squared of the number of variables. 73 The minimum size of terminal nodes "NodeSize" and the maximum number of terminal nodes "MaxNodes" were also left with their defaults values, which are five and null, respectively. 74 In spite of the generalized use of some statistics to assess model performances, there is still an ongoing debate about their use. 75,76 We decided to select three broadly used evaluation methods for cross-comparisons: relative operating characteristics "ROC," 77 Cohen's Kappa "KAPPA," 78 and true skill statistic "TSS." 75 The ROC evaluation method uses the area under the curve (AUC) to discriminate between events and nonevents. Its score ranges from 0 (worst score) to 1 (perfect score), and values under 0.5 are considered to indicate random chance of the prediction. 79 KAPPA statistic is one of the most used methods to measure model performance on presenceabsence predictions, and it indicates the relative accuracy of the forecast comparing with the random chance. It ranges between −1 (the worst score) to 1 (perfect score), where values under 0 indicates no predictive skill. Although these evaluation procedures could be used independently, it is recommended to use several so as to assess the accuracy of the statistical models. This is an index for classifying model prediction accuracy ( Table 2).
The Biomod2 package allows the user to randomly subset the original dataset into two subsets, 70% of the data to calibrate the models and 30% to validate the predictions. When found the best scenario and variables to choose, we repeated the process five times to the best performing algorithm to obtain a robust test of the model, where each replicate uses a unique random split 70% to 30% of the data. 67 Presence and pseudoabsences were set to have the same importance in the calibration process, with a prevalence value of 0.5. The most effective SDM require data on both species presence and the available environmental conditions at random where no presences were reported (known as pseudo-absence data) in the area. 64 Based on model results, the best performing algorithm with the best scenario and representative statistic of SM values is selected. Then, we applied an optimization process to ensure that the algorithm we have settled on is presenting the best possible performance. 80 We tuned the algorithm hyperparameters to find their best combination in terms of predictive performance, and finally an objective comparison of the results. The best tuning parameters were chosen to run the final model.
We used the response curves to assess the prediction of the model, which are independent of the SDM algorithm used. The response curves allow comparing the probability of presence based on ROC, TSS, and Kappa metrics with the variables used in the model. It facilitates the interpretation of relationships between environmental variables and predicted responses of species, even though they may not be apparent from the outputs of the model. 81 The contribution of each variable to the final model is analyzed. The higher the value is, the more influential the variable is in the model. A 0 value means no influence at all.
The aim is to evaluate desert locust presence probabilities to locate potential breeding areas, based on remotely sensed SM conditions.

Results
SM monthly averages (Figs. 5 and 6) suggest a spatial correlation with usual breeding areas, indicating high SM values in the south for the months: July, August, September, and October; whereas higher values are found in the north and northeastern parts of Mauritania during December, January, and February. In general, autumn breeding sites (blue dots in Fig. 6) do not show visual correlation with the monthly mean SM values. Nevertheless, a statistical analysis was not done on a monthly basis but as detailed in Fig. 4.
GLM and RF algorithms were used with SM variables that relied upon various time intervals (16,12,8, and 6 days) and their maximum, minimum, or mean (Tables 3 and 4) SM values. Based on ROC, TSS, and KAPPA statistics, we obtained performance scores from an independent test dataset. The results showed that RF obtained the best performance for our study, whereas GLM performed far behind. The highest scores were obtained when the time interval was 6 days (scenario D) and the representative SM value was the minimum acquired within the time interval. According to Table 2, the RF algorithm obtained a high or very good performance with respect to ROC-AUC with 0.95 and good performance for Kappa and TSS statistics with     0.75. The sensitivity and specificity was over 87%. Slightly lower values are found when using the maximum or mean SM values across the scenario D, demonstrating the suitability of 6 days coverage time to build the SM variables of the model. Scenario A (16 days) obtained the worst model performance when using mean SM values as representative of the given interval. Nevertheless, this scenario still obtained a fair performance of 0.6 for TSS and kappa statistics, and ROC-AUC ¼ 0.90 when using the minimum SM value across their time length. Model performance increases when the time interval of the variables gets smaller and the representative SM value is the minimum for such period. Therefore, we suggest regarding minimum SM values over 6 days period to link solitarious hopper presences and SM values of the ground.
RF was the best performing algorithm, using scenario D and the minimum SM values obtained in each time interval. We have tuned RF algorithm for the two most important hyperparameters: the number of trees "ntree" (50, 500, 1000, 2000, and 4000) and the number of variables randomly sampled as candidates at each split "mtry" (2, 4, 6, 8, and 10). First, we optimized the number of trees and second the mtry. As shown in Fig. 7, the default parameters established by Biomod2 for RF (ntree ¼ 500 and mtry ¼ 4) obtained the best model performance, whose evaluator metrics did not greatly differ from other tuning options. The poorest performance was obtained with ntrees ¼ 50 and mtry ¼ 2 (lower value parameters than the default proposed by BIOMOD2). The increase of ntrees or mtry has not improved model results, with relatively very small changes in model performance. It is also noticeable how the ROC-AUC evaluator remains more or less constant across the different attempts, whereas the changes of TSS and KAPPA are slightly larger.
Therefore, the best algorithm (RF) was optimized after the tuning phase with ntree ¼ 500 and mtry ¼ 4. The best model results were obtained using the variables created with scenario D and the minimum SM reached at each time interval. Finally, we ran RF for five iterations to aim for robust results. Model performance scores are compiled in Table 5.
The metric scores are in accordance with the ones obtained in Table 3 for the same scenario (D) and chosen variables (minimum SM). In general, testing values and sensitivity are slightly lower, whereas ROC-AUC and TSS specificity are somewhat higher. In essence, score values do not differ considerably when running more iterations and averaging their metrics. The impact of SM variables in the final model results (RF, scenario D, and minimum SM) is summarized in Fig. 8.
The most relevant variables for the outcome model were SM1, SM2, SM3, and SM4, which stand for the minimum SM values obtained between 95 and 90, 89 and 84, 83 and 78, 77 and 72 days before the sighting record, respectively. Figure 8 indicates the greater impact of these mentioned variables (mostly over 10%) in comparison with the rest, which do not overcome the 5% per each. Figure 9 shows the response curves of these four more relevant variables that are over Fig. 7 Comparison of different RF results using different tuning parameters, with scenario D and the minimum SM value per interval (best performances in the previous step). X -axis represents the parameter changes and Y -axis the model performance of each tuning combination according to ROC, KAPPA, and TSS statistics.   5% of importance. The plots suggest some potential thresholds of SM content to increase the probability of presence. The minimum SM values acquired during SM1, SM2, SM3, and SM4 denote a positive influence in hopper occurrences. It is observed that the range of SM values in which the probability of presence is over 0.5 varies. Presence probabilities tend to keep steady by 0.5 when SM values reaches 0.15 for SM1, SM2, and SM4. SM3 keeps a high probability over such figure. Nevertheless, there is a common trend by the 0.07 (m 3 ∕m 3 ) to increase the probability of presence within 72 and 95 days afterward.

Discussion
It is widely assumed that rainfall over 25 mm in two consecutive months is generally enough for locust breeding and development. 82 Nevertheless, remotely sensed precipitation in arid environments has some limitations such as high rainfall overestimation due to subcloud evaporation. 83 Aiming to solve the problems associated with remote sensing precipitation, we have analyzed the link from ESA CCI SM remote sensing product with field surveys of hopper desert locust from SWARMS-FAO. In addition, we assess the suitability of this SM product to derive desert locust breeding sites. The importance of SM in egg laying and development has been long known, as well as the role of fresh vegetation, which is greatly determined by water availability in the soil. 4 SM monthly averages suggest a spatial correlation with summer and winter breeding areas. It coincides with the regional climatic conditions of Mauritania as reported in other works. 59,60 Winter rainfall is usual in the north while summer rain in the south of the country. Nevertheless, typical autumn breeding areas do not seem to be accounted for the monthly SM patterns. In arid environments, there is a direct relationship between rainfall and SM 84,85 so that problems such as subcloud evaporation 83 may be avoided with the applied methodology. Despite ESA CCI SM only senses the first 5 cm of the top soil, and desert locusts lay eggs usually at a depth down to 10 cm; this system seems appropriate due to the strong relationship of the top SM with deeper layers. 86 Our analysis reveals the importance of variable creation as a previous step to modeling. We have tested different time intervals for the variable creation. In addition, we have chosen different representative SM values for the given time-span (maximum, mean, and minimum) and presence and pseudoabsence sites. Perhaps, the use of pseudoabsences may be controversial in certain fields because bring some sort of uncertainty into the results. 87 However, their use is generally justified for providing a set of conditions available in the region that need to be included in the SDM. 88 The highest performance was acquired by the RF algorithm when dividing the whole survey time into ranges of 6 days, and selecting the minimum SM as the variable value. Even though previous literature 70 have used the GLM model with a binomial distribution to identify potential factors that determine species presences or absences, GLM approach did not perform well in our study. It was observed that RF performance did not greatly change using hyperparameter values larger than ntree ¼ 500 and mtry ¼ 4 (default values in BIOMOD 2 for RF). Whereas, lower ntree and mtry values performed slightly worse in terms of TSS and KAPPA metrics. According to Ref. 67, our RF model has had an excellent performance based on ROC-AUC metric with 0.946, and a good performance for TSS and Kappa statistics with 0.740 and 0.738, respectively. The probability of hopper detection (sensitivity) is over 85%, being able to correctly identify (specificity) over 86% of the pseudoabsence records. The variables with more weight in the model results were SM1, SM2, SM3, and SM4, whose cover time range from 95 to 72 days before the sighting record. Locust eggs develop and hatch successfully when there is enough moisture in the soil, 40 whereas insufficient moisture may stop egg development or dry them out. 4 Our results indicate that the minimum SM conditions over at least 6 days should remain higher than 0.07 m 3 ∕m 3 . This value is in accordance, although slightly lower, with the SM range proposed by Ref. 89, which is between 0.10 and 0.20 m 3 ∕m 3 . Hopper mortality is closely linked to food shortage, 4 which in arid environments is closely linked with inadequate precipitation. 6,41 Thus, remotely sensed SM may also be a good indicator of suitable conditions to infer hopper presences and locate breeding areas. A good understanding of the geographical relationship between desert locust populations and their potential breeding habitats can improve desert locust survey and control operations. 41 The applied methodology offers very promising results to correctly identify breeding areas based on 30 years of SM values. The ESA CCI SM dataset is the most complete and consistent global SM data record available. 58 To the best knowledge of the authors, there has not been any previous desert locust analysis using this SM dataset. Given the acknowledged importance of SM for desert locust and the length of ESA CCI SM dataset, our results may signify a breakthrough to complement the ongoing locust monitoring techniques used until today.

Conclusions
This paper aimed to assess the significant importance of satellite SM products to locate breeding areas for desert locusts in solitarious phase. Despite remote sensing techniques greatly evolving to date, very few works have addressed the SM relationship to identify desert locusts by earth observation methods. This survey is based on the ESA CCI SM product, the most complete and consistent available SM dataset. We have used a machine learning approach to assess the relationship between desert locust presences and antecedent SM conditions and estimate the accuracy of our model. This study confirmed the robustness of the applied methodology, where 30 years of locust records and SM values were used to feed the model, but note that some uncertainty is expected due to the use of pseudoabsence data.
The monthly SM values suggest a spatial correlation with usual breeding areas in Mauritania. So far, desert locust suitable sites have been mainly delimited based on rainfall estimates from satellite remote sensing. However, some literature marks the high overestimation of these products over dry regions. Therefore, we suggest the use of ESA CCI SM product to overcome that problem either to complement other rainfall products or to substitute them in certain instances of high uncertainty.
Furthermore, we have modeled quantitatively the relationship between hopper presences and SM under different scenarios and variables. The best model performance was obtained by RF, when using the minimum SM value within 6 days interval, for a maximum survey time of 95 days before the sighting date. The validation phase acknowledged the suitability of this methodology to identify hopper presences with an ROC-AUC of 0.94 and TSS and Kappa of 0.74. The importance of SM thresholds and survey time has also been addressed: when the minimum SM value of a certain location overcomes 0.07 m 3 ∕m 3 during 6 days or more, the area becomes favorable as a breeding zone. However, these figures should be taken carefully. Variable importance showed that the most relevant variables of the model would cover between 95 and 72 days before the sighting record. It implies, as highlighted in other works, that certain SM levels need to be maintained over time not just for egg laying but egg development and hatching. So that, monitoring periods should be longer than 6 days to those favorable areas for a successful egg development and hatching. This paper proposes a machine learning approach based on SM time series to predict breeding areas, by means of remote sensing. According to these results, the observed SM during certain periods stands as a very reliable contributor to accurately predict hopper presences in Mauritania; consequently, its monitoring may reduce the locust impact on local communities. Future researches may aim to ensemble other studied environmental variables along with SM datasets to implement more developed warning systems. This increasing amount of information that remote sensing platforms are providing will require the use of artificial intelligence approaches. For instance, the correct use of ensemble SDM may sometimes improve the performance of individual models, which might contribute to solve problems like the exposed in this work.

Disclosures
All authors declare that they have no conflict of interest.
show our gratitude to Keith Cressman and the FAO-DLIS team from the Food and Agriculture Organization of the United Nations to facilitate us SWARMS database and make possible this research, as well as all the current and past locust field workers and National Centres for Locust Control of the affected countries, to collect information about the desert locust and its environment. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.