Accurate tree species information is a substantial part of any forest inventory and supports forest managers’ efforts to conduct sound management decisions.1 Tree species identification provides valuable spatial data that may benefit operational tasks such as modeling the spread of pest and pathogens, such as Sirex noctilio,2 promoting effective weed control strategies in relation to particular forest species,3,4 determining optimal bioclimatic site conditions5 and species level carbon sequestration.6,7 Additionally, determining the composition and distribution of tree species is valuable for assessing indicators related to the ecological integrity of forest ecosystems and could assist in monitoring ecosystem health and ultimately guide forest management policies.8,9 However, obtaining information on forest tree species is challenging when using traditional approaches.
While ground-based methods such as field measurements prove to be costly, time consuming and labor intensive, remote sensing provides a reliable alternative for obtaining information for forest inventory.10 Hyperspectral remotely sensed data have often provided more effective results for mapping tree species over multispectral data, due to the improved spectral resolution that samples the electromagnetic spectrum using hundreds of narrow wavebands.11,12 Mapping forests at species level, however, is challenging since tree species exhibit reflectance that are strongly correlated.11 The variation present at canopy scale may further hamper tree species discrimination applications due to the effects of tree age, phenology, nonphotosynthetic material, and background effects.13,14 Additionally, studies have generally expressed difficulties in classifying tree species that are closely related and within the same genus,1516.17.–18 since the variation between subgenera species is less than the variation between species of different genera.
For example, Goodwin et al.15 showed that the majority of the Eucalyptus species considered in their study was individually inseparable compared to other mesic vegetation; however, they obtained an overall accuracy of 94% when merging all of the Eucalyptus species into one class. Reference 16 discriminated 11 forest types including mixed species and produced an overall accuracy of 75%, yet the study was unsuccessful in classifying individual deciduous species. Recently studies have applied feature selection methods in the context of tree species classification.1920.21.–22 For instance, Dalponte et al.19 used support vector machines (SVM) and Airborne Imaging Spectrometer for Applications (AISA) Eagle hyperspectral data to classify 11 Southern Alps tree species and produced the best kappa accuracy of 0.70, with user’s and producer’s accuracies ranging between 60% and 100%. Hyperspectral data were combined with Lidar data to map five tree species at different scales using SVM and random forest (RF) classifiers.21 Minimum noise fraction (MNF) transformed bands with an 8-m spatial resolution produced the best accuracy of 86% and a kappa value of 0.83. Using SVM and RF, Fassnacht et al.22 compared three feature selection methods to classify tree species at three different test sites. SVM classification results in conjunction with MNF input data proved significant in most cases and outperformed results produced by RF when using genetic algorithm (GA), SVM wrapper and sparse generalized PLS selection methods. Finally, using AISA Eagle image data, Peerbhay et al.20 showed that it was possible to accurately classify six commercial forest species using the PLS discriminant analysis (PLS-DA) algorithm. The study produced an overall accuracy of 80.61%, a kappa value of 0.77 and user’s and producer’s accuracies ranging from 50% to 100%.
The PLS-DA algorthim is able to suppress background effects, address the spectral similarity between tree species and can effectively deal with the computational and statistical problems associated with hyperspectral datasets.20 The method is based on the decomposition of explanatory variables (i.e., the hyperspectral wavebands) into PLS latent components that retain the most important information.23,24 However, the generation of fewer initial components from highly correlated wavelengths are suggested to reduce the chances of model overfitting.24
While only a few studies have investigated the utility of PLS for classification in remote sensing,20,25 PLS-DA has become popular in other research domains. Some of these domains include genetics,26 biology,27,28 and chemometrics.29,30 However, in the analysis of hyperspectral data, it is also of interest to identify the most effective spectral regions that allow for the best discrimination between samples.31 While PLS alone does not provide insight on the most effective bands that may contribute to the final classification task,32 the utility of novel variable selection techniques has been advocated. Many studies often adopt preselection approaches for variable selection in order to improve the performance of PLS classifications.26,33,34 Usually, these approaches are based on some criterion to select high ranking variables which are later included for PLS analysis. For instance, Peerbhay et al.20 showed that selecting wavebands based on the variable importance in the projection (VIP) score is a robust measure for determining individual waveband importance and for producing the best PLS classification accuracies. In their study, incorporating the optimal subset of VIP selected wavebands () in the PLS-DA model resulted in an improved overall accuracy of 88.78% and a kappa value of 0.87, with user’s and producer’s accuracies ranging from 70% to 100%.
Although preselection approaches have been effective, their execution does not involve a complete and computationally efficient way of selecting important variables while performing simultaneous classification. Nonetheless, certain studies have extended the PLS approach to impose sparseness within the technique for the combined purpose of variable selection and dimension reduction.35 Designed explicitly for optimal group discrimination in high-dimensional settings, SPLS-DA effectively overcomes the problem of being affected by a large number of predictors.35 This ability makes SPLS-DA well suited for analyzing high-dimensional data and for selecting important variables when classifying features of interest. It is within this context that this study aims to determine whether simultaneous variable selection and dimension reduction improves the classification of Pinus tree species (Pinus taeda, Pinus elliotii, Pinus patula) using SPLS-DA and AISA Eagle hyperspectral imagery. In addition, incorporating wavebands selected by the VIP method with PLS-DA were assessed.
Methods and Materials
The research was conducted in the 6391 ha Sappi Hodgsons plantation (Centroid: 29° 13′18′′ S and 30° 23′13′′ E) in KwaZulu-Natal, South Africa (Fig. 1). Evenly aged stands consisting of P. patula, P. elliotii, P. taeda are the dominant commercial softwood tree species occurring in the study area. The plantation is situated in the mist belt grassland bioregion of the KwaZulu-Natal midlands with average temperatures in the region of 15.9°C. Rainfall ranges between 730 and , with highly variable precipitation occurring during the summer and additional moisture is provided by heavy mist during the winter.36 The relief of the area is generally hilly and covered by diminutive grasslands with slopes peaking between 1030 and 1590 m above sea level.37 The establishment of the invasive tree, Solanum mauritianum (bugweed), within the study area has not gone unnoticed. Bugweed trees primarily grow in association with the Pinus trees in low to high densities. The prolific dispersal of bugweed is particularly evident when extensive occurrences dominate parts of the forest canopy, whereas other Pinus stands are richly invaded in the forest understory. Due to the prevalence of bugweed trees occurring within the Pinus stands, the invader species was included in this study to provide a more realistic assessment of the classification method.
Hyperspectral Image Acquisition and Preprocessing
During the summer of February 2009, AISA Eagle hyperspectral imagery was obtained under cloudless conditions. Four AISA flight lines with a pixel size of 2.4 m were collected. The applied sensor delivers hyperspectral imagery in 272 bands with a spectral range between 393.23 and 994.09 nm.
A light aircraft was used to collect the hyperspectral imagery at a mean GPS altitude of 2728.42 m and a swath width of 3058 m. The image was atmospherically calibrated using the empirical line method,38 which is based on the linear relationship between in situ measured ground reflectance and the sensor spectral signal. The Analytical Spectral Devices (ASD) FieldSpec® 3 spectrometer (350 to 2500 nm) was used for the acquisition of field measurements to calibrate the reflectance data. The image was topographically corrected using a digital elevation model with contours of 5 m created from topographic maps. The image was referenced to the Universal Transverse Mercator (UTM zone 36S) projection using WGS-84 Geodetic system. Although wavebands after 900 nm showed the presence of spectral noise, these bands were included in this study. ENVI 4.7 image processing software39 was used for the preprocessing of the AISA Eagle imagery.
Field data for P. taeda, P. elliotii, and P. patula consisted of four forest stands per species that were randomly selected from all the forest stands occurring in the study area. A field visit was conducted to assess the condition of the selected Pinus species and coincided with the acquisition of the AISA Eagle imagery during February. Each pine stand was further subsampled randomly using field points to collect image spectra from single pixels (Table 1). Additionally, the occurrences of bugweed within the selected Pinus stands were recorded in field and used as point samples to collect image spectra. Using the R statistical software package,40 the number of test and training samples for each species was then statistically balanced. This was implemented to ensure the ideal optimization of the PLS-DA models and classification using hyperspectral data.41 Figure 2 displays the average spectral reflectance curves in each of the tree species considered in this study.
The sample size for the respective tree species considered in the study.
|Species||Number of tree stands||Point samples||Total sample size|
Partial least squares discriminant analysis
PLS-DA is based upon the classical PLS regression method for constructing predictive models,42 where dimension reduction and the latent decomposition of the X and Y matrices is principle. PLS projects the X matrix in the dimension space where each column of X defines one co-ordinate axis. In an -dimensional hyperplane, which is represented by one line and one direction per component, the X matrix is projected down onto an orthogonal axis, whereas at the same time, the positions of the projected data are related to the values of the response matrix (Y).42 Since the latent component matrix (T) produces linear combinations or scores for X and Y, finding the direction vectors within T is focal to a PLS operation. PLS seeks the columns of which direction vectors relate to X and Y and obtains the most effective variable directions in the X space.23,43 The method can be statistically described by
Variable importance in the projection
While PLS-DA provides no insight regarding the most effective wavelengths that may contribute toward the final classification,32 studies have demonstrated the benefit of utilizing the VIP score for identifying individual waveband importance34,42,44 and determining the most effective spectral regions for classification.26,27 The VIP method42 computes the importance of each waveband by producing scores that serve as a ranked measure of importance amongst the explanatory variables.33 Using the VIP scores to preselect important wavebands in a dataset is, therefore, an essential requirement for a PLS model to achieve good classification performance26 and is defined as follows:45 The important variables of the PLS-DA model were identified by selecting those wavebands that had a VIP score of , since the average of squared VIP scores is equal to 1.33 A new PLS-DA model using the selected VIP bands was developed and then used to classify the test dataset.
Sparse partial least squares discriminant analysis
SPLS-DA closely follows the PLS-DA approach whereby the categorical response variables are initially observed as continuous in order to construct latent components. However, SPLS-DA imposes sparseness within the latent components to promote variable selection while performing simultaneous dimension reduction. Irrelevant and noisy variables are scored to zero by imposing penalty,46 thus eliminating any contribution toward the models’ discrimination power. In addition, the latent components are built to explain the best discrimination among classes by using only the few informative variables (non-zero variables). Class membership of each variable is then assigned by reference cell coding the response matrix (Y) with dummy variables.35 Y is assumed to be one of the classes () indicated by . The recoded response matrix is then defined as an matrix with:35
Optimizing PLS-DA and SPLS-DA
To determine the number of components for PLS-DA, 10-fold cross validation (CV) was implemented.42 Each component was systematically added to the PLS-DA model and the cross validated error was then calculated. The process was repeated on the training data until the addition of further components did not improve the significance of the PLS-DA model.20 In the case of SPLS-DA, there are only two key tuning parameters that require optimization for ideal model performance.35,46 These include the number of latent components “” and a sparsity thresholding parameter “eta” that can be optimized using CV. While “” largely depends on the number of variables and sample size it has been recommended to search for components between 1 and 10 with a thresholding parameter ranging between 0 and 1.46 The most optimal latent component, therefore, retains the most effective wavebands, whereas other non-important bands would have a probability of zero. The optimized SPLS-DA model developed was then used to classify the test dataset. PLS-DA and SPLS-DA model optimization, VIP calculations and classification was done using the R statistical software package.40
Classification accuracy assessments
The dataset () was divided into training (70%; ) and validation data (30%; ). Confusion matrices were calculated based on classification results conditioned on the validation dataset. The entire process was repeated 100 times to account for the variation in classification accuracy due to differing compositions of training and validation samples.22,47 The quantity and allocation disagreement was then used to measure the disagreement within the error matrix as suggested by Pontius and Millones,48 who criticize the utility of kappa analysis. The quantity disagreement quantifies the amount of tree samples in the training data that differs from the quantity of samples of the same tree species in the test data while the allocation disagreement measures the amount of tree samples of a particular species in the training dataset that were allocated to different locations of the same species in the test dataset. For the purpose of this study, the quantity and allocation disagreement were combined and the total disagreement of the error matrix reported.48 Additionally, individual class accuracies are reported by the user’s and producer’s accuracies. The former is calculated by dividing the number of correctly classified species by the total number of species that were classified in that particular class and is represented by the row total in the confusion matrix. Producer’s accuracy is computed by dividing the number of correctly classified species in each class by the number of training data used for that particular class and is expressed by the column total in the confusion matrix.47
Figure 3 illustrates a significant decrease in the CV error from the first component (59.63%) to using 10 components which yields a CV error of 17.08%. The lowest error was produced by using five components (11.60%), with the model stabilizing when using nine components to produce a constant CV error (17.08%). The five latent components were used to develop the PLS-DA model and VIP scores for individual bands were then calculated.
PLS-DA variable importance using VIP
Figure 4 shows the waveband importance as determined by the VIP method. The VIP method placed importance on bands located within the visible (393 to 700 nm) region of the electromagnetic spectrum. A total of 80 bands obtained VIP scores of and were located within the blue (393 to 500 nm), green (521 to 560 nm), and red (676 to 700 nm) regions. More specifically, 49 bands were considered important in the blue region, 19 in the green, and 12 in the red portion of the spectrum.
Results indicate that utilizing the VIP bands () produced an overall accuracy of 71.88% and a total disagreement of 28. Accuracies for individual species user’s and producer’s accuracies ranged from 58% to 83% (Table 2). In comparison, using all the AISA Eagle bands () produced a lower classification accuracy of 68.75% with user’s and producer’s accuracies ranging between 50% and 79%. For comparison purposes, LDA was used to classify the AISA dataset using the VIP bands. The LDA results revealed an overall classification accuracy of 66.42% with user’s and producer’s accuracies ranging between 50% and 77%.
Summed confusion matrix based on the PLS-DA classification algorithm and wavebands selected by the VIP (wavebands=80). The values in bold indicate the number of correctly classified samples.
|P. elliotii||P. patula||P. taeda||S. mauritianum||Row total||User’s Accuracy (%)|
|Producer’s accuracy (%)||75||83||71||58|
|Overall accuracy = 71.88%|
|Allocation disagreement = 26|
|Quantity disagreement = 2|
SPLS-DA model optimization
Figure 5 indicates the significance of each SPLS-DA latent component. The first component yielded a CV error of 40.05% which was later reduced to 13.33% by using 10 latent components. The most significant component (), however, was achieved by using eight latent components with an “eta” of 0.9 and produced the lowest CV error rate of 10.36%. The model eventually stabilized at a constant value of 13.33%. The eight latent components were then used to develop the SPLS-DA model.
Test dataset results indicate that using the AISA Eagle hyperspectral bands with eight SPLS-DA components produced an overall accuracy of 80.21% and a total disagreement of 20. User’s and producer’s accuracies for each species ranged from 67% to 92% (Table 3).
Summed confusion matrix based on the SPLS-DA classification algorithm and the Airborne Imaging Spectrometer for Applications (AISA) Eagle hyperspectral dataset. The values in bold indicate the number of correctly classified samples.
|P. elliotii||P. patula||P. taeda||S. mauritianum||Row Total||User’s Accuracy (%)|
|Producer’s accuracy (%)||88||92||75||67|
|Overall accuracy = 80.21%|
|Allocation disagreement = 18|
|Quantity disagreement = 2|
Figure 6 displays the variation in classification accuracy produced by SPLS-DA when using 100 iterations for splitting the training and validation dataset. Classification means were found to be with a standard deviation of 2.87.
Figure 7 shows the most effective wavebands selected by the SPLS-DA algorithm and that were automatically used in the classification process. The method placed importance on bands located within the visible (415 to 694 nm) region of the electromagnetic spectrum. The SPLS-DA model used a total of 55 bands which best explained the discrimination among the tree species and were located in intervals within the blue (415 to 436 nm; 457 to 483 nm), green (515 to 521 nm; 530 to 565 nm), and red regions (674 to 694 nm), respectively. In total, 24 bands were considered important in the blue, 21 in the green, and 10 in the red portion of the spectrum.
In comparison to the SPLS-DA results, utilizing these bands in LDA revealed an overall accuracy of 72.9% with user’s and producer’s accuracies between 56 and 80%. Since the SPLS-DA classification produced the best results, a classified tree species map was produced using a subset of the AISA Eagle imagery (Fig. 8). The map is comparable to that of the AISA Eagle airborne hyperspectral image, with P. patula being the dominant tree species. P. taeda and S. mauratanum have the most confusion with each other and are the least correctly mapped species, respectively.
One of the most prominent challenges in discriminating forest species using remotely sensed data is to use the subtle spectral variations between species to classify them correctly. This study presents valuable evidence for the application of utilizing hyperspectral remote sensing to classify commercial tree species in KwaZulu-Natal, South Africa. Results show the capability of the AISA Eagle image data in effectively dealing with the spectral similarity existing between the closely related Pinus species considered in this study. In addition, the utility of the SPLS-DA algorithm proved more effective compared to PLS-DA and VIP while providing an accurate framework for executing simultaneous variable selection and dimension reduction of high-dimensional datasets, which is necessary if we are to fully exploit hyperspectral image data in classifying commercial forest tree species.
PLS-DA and SPLS-DA classification using AISA Eagle hyperspectral data
The generation of fewer initial components within a PLS-DA model is critical in reducing the risk of overfitting and removing the low order components which do not contribute toward the models’ performance.23,24 Subsequently, the results indicate that the systematic addition of latent components to the PLS-DA models significantly improves model performance based on the CV error. Using five optimal latent components in PLS-DA in conjunction with VIP selected bands produced an overall classification accuracy of 71.88%. When utilizing eight optimal components, SPLS-DA produced an 8.33% improvement in the overall classification accuracies. This classification result is comparable to that of previous forest species discrimination studies using hyperspectral datasets. 6,11,1314.15.–16,49,50 However, this classification result has been achieved using a low number of species (i.e., four species) when compared to the number of species considered in Refs. 9–11, 49, and 50. Alternatively, other feature selection and extraction techniques have been applied for the classification of tree species using hyperspectral data. These include stepwise LDA,8,51,52 out-of-bag and best-first search method,53 MNF transformations,21,54 sequential forward floating selection,19,55 GA, SVM wrapper, and sparse generalized PLS selection.22
When observing the individual classification accuracies of each tree species considered in this study, SPLS-DA produced higher individual class accuracies (67 to 92%) compared to the accuracies produced by PLS-DA and the VIP selected bands (58 to 83%). Furthermore, there was an improvement in the user’s and producer’s accuracies for P. elliotii and P. patula when compared to the accuracies obtained in a previous study20 that discriminated Pinus, Eucalyptus and Acacia tree species. This result confirms the findings of Wolter et al.24 and Wolter et al.,43 who suggest that separate PLS models could be constructed to improve individual class accuracies. As a result, individual PLS models use the spectral information to explain the variance for species within a genus (for example, P. elliotii and P. patula) such as in this study as opposed to species from different genera (for example, E. grandis and P. patula). However, most of the confusion occurred with Pinus trees and bugweed (S. mauritianum). The results show that bugweed were the least correctly classified class and that the majority of the confusion occurred between bugweed and P. taeda. Nonetheless, the classification accuracies obtained in this study for each tree species may be influenced by a variety of other factors linked to the spectral variability within the canopy of each forest stand. For example, the variation in reflectance within forest species primarily occurs as a result of canopy shadowing, differences in light absorption, and spectral scattering of wavelengths.14 Additionally, researchers have noted that the classification of tree species may also be affected by the overall structure of the forest canopy, sensor optical properties, and the effects of the nonphotosynthetic material.13
PLS-DA and SPLS-DA variable importance
While both models (PLS-DA and SPLS-DA) performed classification successfully, the exclusive variable selection approaches provided valuable insight on the most effective wavebands when classifying the tree species. The VIP method successfully reduced the large number of hyperspectral bands to 80 important wavebands to produce a reasonable level of accuracy (71.88%) compared to when all the bands () were utilized (68.75%). SPLS-DA, however, executed variable selection automatically to include only important variables within the PLS classification and successfully reduced the hyperspectral bands to 55 relevant wavebands to produce the best classification accuracies. Nonetheless, given the spectral range of the AISA Eagle sensor, 80 and 55 bands are still a high number when compared to other forest species classification studies and could be a potential drawback of the methodology. For example, Clark et al.13 applied 30 bands at crown level and obtained a high accuracy of 86%. Using 30 AISA Bands, Dalponte et al.19 obtained a kappa accuracy of 0.70. Similarly, Jones et al.8 applied 40 AISA bands and mapped most tree species with accuracies ranging from to 90%. Liu et al.54 used 26 spectral bands and obtained 80.67% classification accuracy for mapping temperate forest species. However, their results were based on a MNF transformation. Additionally, Jones et al.8 and Clark et al.13 investigated larger spectral ranges beyond the visible and near infrared regions that were used in this study.
When comparing the important bands selected by PLS-DA using VIP and those inherently selected by SPLS-DA, results show that bands in the visible region of the spectrum (393 to 700 nm) were most effective in the classification. More specifically, PLS-DA and VIP placed importance on 49 bands in the blue (393 to 500 nm), 19 bands within the green (421 to 560 nm), and 12 bands in the red (676 to 700 nm). In comparison, SPLS-DA selected fewer bands, also within the visible portion and along narrower wavelength intervals than the band ranges of VIP. For instance, SPLS-DA placed importance on 24 blue wavebands located between 415 to 436 nm and 457 to 483 nm, 21 green wavebands between 515 to 521 nm and 530 to 565 nm and 10 bands within the red at 674 to 694 nm. While wavebands within the blue portion of the spectrum are recognized for classifying tree species, those located within the green region confirm the importance of the green reflectance peak around 550 nm.20,56 The significance of the red region is also recognized for the discrimination of tree species20 and is a result of the red portion being sensitive to plant pigment concentrations within the leaf tissue.5758.–59 Overall, the importance of visible wavebands selected in this study for the classification of tree species is comparable to that of other studies who also recognize the significance of wavebands in the visible for the classification of tree species using hyperspectral data.1314.–15,20,60
The operational limitation of this study is, however, highlighted by the procurement of relatively homogenous pixels of each tree species to exploit the subtle variations existing between them. Nonetheless, the proposed methodology of this study should be tested in areas that have a heterogeneous composition of the selected tree species and could be expanded to species that are native to South Africa. This would require some variation in the methodology due to the denser spatial configuration of native trees. Future studies should also consider the application of stability measures or iterative bootstrap classification approaches.21,22 Such approaches would capture the variation created by changing the composition of training and validation datasets to improve the reliability and quality of classification results. The robustness of the waveband regions selected by the SPLS-DA technique should also be investigated using other commercially available sensors for classifying tree species. For example, spectral regions of importance included narrow band ranges in the blue (415–483 nm), green (515–565 nm), and red (674–694 nm) portions of the spectrum. This provides an opportunity to exploit the new generation of multispectral sensors (such as WorldView-2), with fine spatial resolution and spectral resolutions, to discriminate among tree species in South Africa.
This study has shown the capability of utilizing SPLS-DA for the combined purpose of variable selection and dimension reduction of high-dimensional data for the classification of commercial tree species. SPLS-DA produced an overall accuracy of 80.21% and a total disagreement value of 20. Accuracies for the individual tree species ranged between 67% and 92% with the most effective wavebands located in the visible portion (415 to 694 nm) of the spectrum. Overall, the utility of SPLS-DA provided an accurate and computationally efficient methodology for selecting important variables within the PLS framework, while performing simultaneous classification for the successful discrimination of commercial tree species.
The authors would like to acknowledge the support from the Applied Centre for Climate and Earth Systems Science (ACCESS) and Sappi forest-SA for the successful completion of this paper.
Kabir Yunus Peerbhay is a PhD candidate at the University of KwaZulu-Natal specializing in remote sensing. He received the MSc degree (cum laude) in applied environmental science in 2011. His research is focused on using machine learning algorithms to maximize the benefit of remotely sensed data for forest inventory practices and weed detection applications.
Onisimo Mutanga is a professor and academic leader in research at the University of KwaZulu-Natal, South Africa. His research is focused on ecological assessment and monitoring with special emphasis on vegetation pattern analysis using GIS and remote sensing. He is currently expanding this domain into mapping vegetation species, wetland mapping, disease detection in plantation forests and agricultural crops as well as quantifying forest fragmentation and its impact on biodiversity and ecosystem condition.
Riyad Ismail received the MSc degree in GIS (cum laude) and PhD degree in remote sensing from the University of KwaZulu-Natal, South Africa. He has over 15 years of experience in implementing spatial technologies (GIS, GPS and remote sensing) at commercial, academic and research institutions. He was recently appointed as a senior research associate at the University of KwaZulu-Natal and is currently employed as a principal research officer at Sappi forests.