Predicting large wildfires in the Contiguous United States using deep neural networks

Abstract. Over the last several decades, large wildfires have become increasingly common across the United States causing a disproportionate impact on forest health and function, human well-being, and the economy. Here, we examine the severity of large wildfires across the Contiguous United States over the past decade (2011 to 2020) using a wide array of meteorological, land cover, and topographical features in a deep neural network model. A total of 4538 wildfire incidents were used in the analysis covering 87,305 square miles of burned area. We observed the highest number of large wildfires in California, Texas, and Idaho, with lightning causing 43% of these incidents. Importantly, results indicate that the severity of wildfire occurrences is highly correlated with the weather, land cover, and elevation of the study area as indicated from their SHapley Additive exPlanations values. Overall, different variants of data-driven models and their results could provide useful guidance in managing landscapes for large wildfires under changing climate and disturbance regimes.


Introduction
2][3][4][5][6][7] In the realm of forestry, various machine learning (ML) algorithms have been employed to explore aspects of forest ecology, such as species distribution models, carbon cycles, hazard assessment, and prediction. 8,9Wang et al. 10 and Sharma et al. 11 showcased the use of deep learning methods, such as YOLOv4 and YOLOv5m, in forest resource investigation, vegetation coverage statistics, and plant growth monitoring.Similarly, Firebanks-Quevedo et al. 12 employed ML-based methods to formulate forestry policies and identify economic incentives for reforestation.However, a limited number of studies have been conducted to predict the spread of wildfires, a crucial aspect given the multifaceted challenges posed by wildfires, including ecological damage, deteriorating air quality, biodiversity loss, erosion, and soil degradation.
Wildfires have increased fourfold over the past 40 years primarily due to fuel accumulation and fuel aridity resulting from fire suppression and climatic variability. 13In 2022 alone, there were 68,988 wildfires burning a total of 7.8 million acres in the United States.Approximately, 70,000 wildfires have been occurring every year over the past decade burning 7 million acres annually.Indeed, wildfires depend on ecoregions and ignition sources and are reported to cause serious repercussions on climate and ecology. 14They impair wildlife habitat, alter forest structure and composition, reduce biodiversity, 15 change soil structure and watershed processes, 16 and affect human values, property, 15 health, and well-being. 17Recently, Burke et al. 13 estimated that nearly 25% of PM2.5 across the United States results from wildfires. 13However, a paradigm shift in wildfire policy has been apparent in recent years to counteract long-term risks and restore ecological functionality. 15,18Fires and associated problems are increasingly viewed from socio-ecological lenses and different management approaches, such as prescribed fire, 19 fuel treatments (mastication, thinning), 20 and polycentric all land management. 21Yet, wildfire risk assessment and modeling are challenging due to dynamic climatic variables and complex fire behavior.Improved predictive tools and approaches are, therefore, necessary for wildfire prediction and managing unprecedented fires over time and space scales.Much progress has been made in using artificial neural networks, particularly multilayer perceptron in predicting wildfires, but studies focusing on the use of deep neural networks (DNN) in predicting wildfire spread are generally few.DNN, such as convolutional neural networks and recurrent neural networks (in particular, long short-term memory networks), are deep learning methods that have multiple nonlinear hidden networks and have been successfully applied in detecting wildfires from satellite observations 22 or predicting wildfire spread using meteorological variables, such as wind, temperature, and humidity. 23However, many such studies are limited to small spatial and temporal scales.In this short communication paper, we examine the severity of large wildfires across the Contiguous United States over the past decade (2011 to 2020) using a wide array of meteorological, land cover, and topographical features in the DNN model.Here, large wildfires are used to refer to the areas burned being greater than 500 acres in the Eastern and 1000 acres in the Western United States.The data-driven approaches in this paper will be instrumental in understanding different factors influencing the occurrence and severity of wildfires and thereby facilitating wildfire management and policies.

Materials and Methods
The study area comprises the Contiguous United States (CONUS), which is divided into 11 Level I Ecoregions and 967 Level IV sub-Ecoregions. 24The western regions of the study area typically experience a higher number of wildfire incidents and encompass larger burned areas compared with the western United States 25 due largely to the heterogeneity in the landscape caused by human development and fragmentation of forest land cover. 14The GIS data for wildfire locations and burned area boundaries were obtained from the Monitoring Trends in Burn Severity (MBTS) program. 26,27The program assesses the frequency, extent, and magnitudes of all large wildland fires in the United States.The thresholds for large wildfires are set to greater than 1,000 acres in the western United States and 500 acres in the Eastern United States.A period of 10 years between 2011 and 2020 was selected for analysis, and the "prescribed wildfires" were removed from the dataset.A total of 4,538 wildfire incidents were used in the analysis covering 87,305 square miles of burned area.Additionally, a 1992-2015 spatial wildfire occurrence dataset 28 was used to analyze large wildfires. 29In order to identify potential wildfire hotspots, the number of wildfire occurrences and burned areas were also evaluated within each Level IV ecoregion.Figure 1 shows the point locations for the occurrence of large wildfires and potential wildfire hotspots between 2011 and 2020 in the Contiguous United States.
Meteorological variables were obtained from different sources (Table 1) for wildfire prediction.Briefly, monthly climate attributes including total monthly precipitation, mean monthly temperature, and maximum and minimum vapor pressure deficit were obtained from the PRISM dataset.The Palmer Drought Severity Index (PDSI) was obtained from GRIDMET to infer the relative dryness in the region.The index typically ranges from -10 (dry) to +10 (wet). 36The land cover data was obtained from the National Land Cover Dataset (NLCD).The 30-meter NLCD raster for year 2016 was used to obtain land cover percentages around a 4-kilometer buffer at the point of wildfire occurrence.The 4-kilometer radius was selected based on the mean burned area of all wildfires in the dataset to represent the amount of forest and shrubland available near the fire area that could potentially increase the extent of wildfires.The relationship between land cover and wildfires was examined using NLCD land cover classes within the 4,538 burned area boundaries.Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) within 1 kilometer resolution, were obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite dataset (MOD13A3).Elevation data was obtained from the United States Geological Survey (USGS) Digital Elevation Model (DEM) dataset at 100-meter spatial resolution.All these datasets were spatially and temporally linked to each of the 4,538 wildfires that occurred in the contiguous US between 2011 and 2020 using R 4.3.0 and ArcGIS (Version 10.2).
Different datasets in Table 1 were analyzed using ML models.The best model was selected using the lowest testing Mean Absolute Error (MAE) criterion.Consequently, as observed in Table 2, a DNN model was trained to predict wildfires based on climatological and geological attributes surrounding the point of wildfire origin.The features used in the DNN model to predict large wildfire burned areas are shown in Table 3. Keras and TensorFlow libraries in Python were used to design the deep-learning approach.The dataset was divided into training and testing sets using an 80/20 split.Thus, 80% of the data were used for training and validation and 20% for testing the accuracy of the models.Further, the data was split three times to generate multiple random samples of training and test data to evaluate the accuracy over multiple test set   combinations.Wildfires with missing attributes were removed from the study resulting in a total of 4536 observations for model development.Prior to being used in the DNN model, the wildfire acres were log-transformed to account for any skewness in the observed data and to normalize the target distribution.Different features in Table 3 were transformed using a standard scalar and were fed as inputs to a DNN model with five layers.The DNN layers had 512, 256, 64, 16, and 1 neuron, respectively.ReLU was used as the activation function for each of the five DNN layers.The DNN model was trained using root mean square optimizer and 0.01 learning rate.Callbacks were used to monitor validation loss.Mean squared error was utilized as the loss function and MAE was used as a performance metric.The model was trained for 200 epochs with a batch size of 32 and a validation split of 0.2.For each of the three values of random seed that was used for generating train and test sets, plots for training loss and validation loss were convex in nature as shown in Fig. 2. A schematic depicting the proposed deep learning framework is given in Table 4.The error rate for the test data was determined using the equation below: E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 2 ; 1 1 7 ; 5 6 8 Error rateðMAEÞ ¼ P N 1 jy obs − y pred j P N 1 jy obs j : SHapley Additive exPlanations (SHAP) values were used to determine the impact (positive or negative) of each model feature on the burned area.SHAP is a surrogate explanation method for ML models, which computes values that quantify the contribution of each feature to a prediction based on cooperative game theory. 37Thus, SHAP values could be used in interpreting the DNN model and determining the potential drivers of wildfire.For each data point, the model predicted value equals the sum of all feature SHAP values and the average prediction.A positive SHAP value indicates an increase in the predicted value due to the feature, whereas negative SHAP values indicate a decrease in the predicted value.

Results and Discussion
Wildfires are natural or human-induced events occurring in forests, grasslands, and prairies driven by ignition, fuel, droughts, and conductive weather conditions. 38The distribution of total large wildfires by states and potential causes is shown in Fig. 3.The highest number of large wildfires between 2011 and 2020 occurred in California (448 incidents), followed by Texas (434 incidents), and Idaho (426 incidents).About 43% of large wildfires were caused by lightning, followed by "miscellaneous" (18%), unidentified (10%), arson (9%), equipment use (8%), and debris burning (6%).Importantly, our data exclude small wildfires (500 acres) that are more frequent and are caused largely by human activities. 39The percentage of burned area per level IV ecoregion illustrates the severity of wildfires in various ecosystems (Fig. 4).The area consumed by wildfires was higher in Mediterranean California, the Marine West Coast Forest, and North American Desserts, and smaller in Northern and Eastern Temperate Forests (Fig. 4).Most of these burned areas were grassland, forest, and shrub/scrub land covers (Fig. 5).The mean absolute SHAP values for grassland, forest, and shrub cover were 0.6, 0.43, and 0.35, respectively (Fig. 6), indicating their predominant positive role in wildfire spread.It was also observed that the highest number of wildfires occurred in July and August, which are typically the hottest and driest months.Temperatures in these months were ∼21°C and 24°C, respectively (Fig. 7).Indeed, warmer temperatures and extended droughts may exacerbate the vulnerability of forests and the occurrence of wildfire events.The climatic dependency of wildfire behavior and spread further highlights the importance of managing fuel and restoring ecology in combating fire hazards and associated impacts. 40ere, several ML models, such as polynomial regression, 41 support vector regression, 42 decision tree regression, 43 random forest regression, 44 gradient boosting regression, 45 and a DNN model, were utilized to predict wildfires occurrence based on climatological and geological features.Only a few studies have attempted to utilize ML models in wildfire studies.For example, Zhang et al. 46 compared four multilayer perceptron and CNN architectures in wildfire modeling and reported the highest accuracy in predicting seasonal peaks in fire activity and vulnerable areas with CNN-2D, a DNN model.Langford et al. 47 used DNN to detect wildfire events in Alaska for the wildfire year 2004 and highlighted the utility of the validation-loss weight selection approach for accurately mapping wildfire on an imbalanced dataset.In another study, deep neural computing optimized by using adaptive moment estimation algorithms showed the highest accuracy in forest fire prediction compared with stochastic gradient descent, root mean square  propagation, and Adadelta optimizers. 48In our model, for test sets generated in each of the three values of random seed, the MAE was found to be between 0.055 and 0.06.This lower value of MAE indicates a higher accuracy of wildfire prediction.
The land cover classes around a 4-km buffer at the point of occurrence including the percentage of grasslands/herbaceous, percentage of forests, and percentage of shrublands were found to be the most influential in predicting wildfire burned area within a 4 km radius of the point of wildfire occurrence.Fire activities in such locations are largely associated with fuel loads and flammability.Fuels in grasslands are generally dry, which could easily and rapidly spread fires. 49The location of wildfires, as represented by latitude, was also important in predicting burned areas.Indeed, precipitation regimes vary with latitude-longitudes, with lower latitudes exhibiting reduced rainfall and moisture, and drier conditions.
A non-linear relationship existed between features and their impact on the predicted burned area (Fig. 8), consistent with many other global studies. 50Also, the predicted burned areas  exhibited a trend closely resembling that of the actual burned areas (Fig. 9).Large forest cover within a 4-km buffer zone surrounding the point of wildfire occurrence had a large positive impact on the burned area.A forest cover of 30% or more increased the predicted burned area above the mean.More western longitudes presented a significant increase in the burned area.However, higher elevations had positive SHAP values indicating larger burned areas in regions with higher elevations.In general, fire activities are higher in steeper areas. 49In the western United States, Westerling et al. 40 observed the greatest wildfires in the mid-elevation range, occurring mostly as episodic events.These events were further associated with spring snowmelt timing.Topographic features may, however, develop decisively in fire spread when burning conditions are rather less extreme. 51Finally, we observed that values of NDVI greater than 0.5, indicating vigorous green vegetation, had a positive impact on the burned area, but NDVI values less than 0.5, indicating sparse vegetation, had no net effect on the burned area.

Conclusion
This study analyzed and predicted the large wildfires across the contiguous United States from 2011 to 2020.Results showed that the highest number of large wildfires and areas consumed by wildfires occurred in California.Also, wildfires occurred mostly during July and August months.A comparison of different models showed that a four-layered DNN model outperformed other ML models.Further, land cover and the location (latitude and longitude) of wildfire occurrence were most likely to determine the severity and extent of wildfires in the United States as inferred from their SHAP values.Indeed, predictive models utilizing ML and remote sensing tools, climate, and geospatial data are useful in understanding wildfire complexity and predicting and mitigating fire hazards.However, additional features, such as soil characteristics and 100-h fuel moisture, could be integrated into the DNN model to improve model accuracy and prediction.

Fig. 1 Table 2
Fig. 1 Large wildfire incidents in the contiguous United States between 2011 and 2020.

Fig. 2
Fig. 2 Training and validation losses for the proposed deep learning framework.

Fig. 3 Fig. 4
Fig. 3 (a) Average annual large wildfire incidents by states and (b) cause of large wildfires (> 500 acres) in the contiguous United States between 2011 and 2015.

Fig. 6
Fig. 6 Feature importance in the DNN model obtained from SHAP values.

Fig. 5
Fig. 5 (a) Burned area by NLCD land cover in large wildfires between 2011 and 2020 and (b) an example of NLCD land cover within the burned area in the September 2011 Riley Road wildfire northwest of Houston burning 19,000 acres of land.

Fig. 8
Fig. 8 Partial dependence plots showing the interactions between features and burn area using SHAP values.

Fig. 7
Fig.7Plot showing the relationship between average monthly large wildfires (primary y -axis) in the contiguous US and the mean monthly temperature (line).

Table 1
List of datasets used in the study to model burned areas in large US wildfires.

Table 3
Features used in the DNN model to predict large wildfire burn area with minimum and maximum values in the dataset.

Table 4
Schematic of the proposed deep learning framework.