Pattern recognition model for aerosol classification with atmospheric backscatter lidars: principles and simulations

Abstract. A pattern recognition model for aerosol classification with atmospheric backscatter lidars is proposed and studied in detail. The theoretical framework and the implementation process of the proposed model are presented. Computer simulations have been carried out to verify the practicability and robustness of this model. The k-fold cross-validation method is employed in the process of classifier designing to choose the proper decision rule, which is mainly based on statistical pattern recognition theory. At the same time, the validity of the model is evaluated. The generalized self-validation is also carried out in the computer simulations to verify the stability of the model. The analysis of the performances in reduced status, especially the instance of application to Cloud-Aerosol Lidar with Orthogonal Polarization, demonstrates the generalization ability and performance of this model.


Introduction
Air pollution is becoming more and more serious in China and it is well understood that more should be known before we could control it.Many methods have been proposed for component analysis of haze.Using offline analysis means one can analyze the air samples comprehensively and accurately, but the analyses are time consuming and cannot represent the original characteristics of the aerosols as the chemical components of aerosols are susceptible to be changed.Realtime online means, including mass spectrometry of single particle aerosols and some in situ detection instruments (such as hygroscopic tandem differential mobility analyzers and cavity ring down spectroscopy) may analyze the components of aerosol continuously. 1,2However, these methods can only work by single-point sampling in a limited area and thus are impractical to employ for analysis over a larger area.
On the contrary, remote sensing methods have very good prospects as they can obtain observation data on a larger scale and can detect the air pollution and distribution of aerosols rapidly without being placed in the detected air. 3,4In particular, lidar is very suitable for remote sensing of atmospheric aerosols because of its high spatial and temporal resolution, and can be developed for vehicular and airborne application. 5,6Many observation data have been obtained from the lidar stations and airborne-lidar worldwide, and the optical characteristics of the aerosols can be retrieved from the raw data.Aerosol type classification from atmospheric lidar remote sensing is important to know more about the different effects of aerosols from one type to another type.
In the authors' previous work, a coarse pattern recognition model was proposed for aerosol identification with atmospheric backscatter lidars. 7In this paper, the principle and implementation process of this method is enriched and described in more detail.Computer simulation shows dual-wavelength polarized high-spectral resolution lidar (HSRL) is better than other types of lidars for the classification of aerosols.Moreover, validation and analysis using the data from Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) reveal that this model for aerosol classification can be used widely in the analysis of the lidar data.
This paper is constructed as follows.The optical properties of aerosol that could be employed to construct the optical feature vector of the pattern recognition model are briefly reviewed in Sec. 2. Section 3 gives a detailed description of the construction of the pattern recognition model, which is followed by simulation studies in Sec. 4. To validate the proposed model, the application to CALIOP is analyzed in Sec. 5. Section 6 gives the conclusive remarks of this paper.

Optical Properties of Aerosol
With elastic backscatter lidar, some optical properties of aerosol can be retrieved, such as backscatter coefficient, depolarization ratio, color ratio, and so on.Some of these optical properties are important for aerosol type classification, and are required input data for the pattern recognition model.

a. Lidar ratio
Backscatter coefficient β and extinction coefficient α can be retrieved from the standard elastic backscatter lidar equation 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 4 1 0 where P k and P ⊥ represent the backscatter signals from parallel and perpendicular channels, respectively; C k and C ⊥ represent the system constants for the parallel and perpendicular channels, respectively; r is the range of the scatter volume from the lidar, Ψ is the transmitter-receiver geometric overlap function.The subscripts m and a represent the molecule and aerosols, respectively.Standard elastic backscatter lidars often need to assume the lidar ratio (aerosol extinction-to-backscatter ratio) in advance as there are three unknowns (β k a , β ⊥ a , α a ) in two lidar equations, 9 that is, ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 2 6 8 b. Optical depth The aerosol optical depth (AOD) refers to the integral of atmospheric extinction coefficient in vertical direction, which can be used to describe the attenuation effect to light of the atmospheric aerosol.That is, E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 1 8 7 The extinction angstrom exponent is often used to parameterize the wavelength behavior of AOD.The relationship between the AOD and the wavelength is described as follows: 10 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 1 0 7 τ a ðλÞ ¼ Bλ −A ; (4) where B represents the turbidity coefficient of aerosols, λ 1 and λ 2 represent two reference wavelengths, A represents the angstrom wavelength exponent, which is an important optical parameter to describe the size of aerosol particles.The greater its value is, the smaller the particle is and vice versa.

c. Color ratio
The backscatter-related angstrom exponent is often used to characterize aerosols, and can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 6 1 5 where β i a represents the aerosol backscatter coefficient at wavelength i, C a ¼ β 532 a =β 1064 a represents the backscatter color ratio, which is independent of the particle concentration. 11The Angstrom exponent α β is related to the size of aerosol particles and is typically below 0.5 for larger dust particles, whereas it is bigger than 1 for anthropogenic particles. 12d.Depolarization ratio There are also two important parameters derived from the backscatter coefficient.One of them is the particle linear depolarization ratio E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 4 8 5 and the other is the spectral depolarization ratio The atmospheric aerosols optical parameters (such as the depolarization ratio, lidar ratio and so on) are closely related to the physical properties of the particles.These optical parameters of different kinds of particles are not exactly the same at a single wavelength and each parameter also varies with the wavelength.Taking the depolarization ratio as an example, researchers found that pure dust has typically larger 532-nm depolarization ratio (up to 30% to 35%) 13 and that of a mixture of dust and spherical particles is about 8% to 10%. 14In terms of aerosol lidar ratio, it can vary widely depending on aerosol size distribution and refractive index, and the aerosol lidar ratio is highly variable (20 to 100 sr).6][17] Therefore, it is not hard to understand that these optical characteristic parameters (such as depolarization ratio, lidar ratio and so on) play an indispensable role in aerosol classification.
Moreover, it is necessary to point out how can obtain the aforementioned optical parameters.Standard elastic backscatter lidars need to assume lidar ratio in advance, then a single-wavelength polarized Mie lidar can retrieve the depolarization ratio δ a and a dual-wavelength Mie lidar with polarized sensitivity only at 532 nm, which is widely adopted in the AD-net and satellite-based lidar (CALIOP) at present, 18,19 can retrieve the depolarization ratio δ a along with the color ratio C a .It should be noted that even with elastic backscatter lidar, the optical properties of aerosol can be retrieved with a certain degree of accuracy.][22] Thus, a single-wavelength polarized HSRL can retrieve the depolarization ratio δ a and lidar ratio S a , and a dual-wavelength polarized HSRL can retrieve the depolarization ratio δ a , spectral depolarization ratio R a , backscatter color ratio C a , and lidar ratio S a accurately as it can work on two wavelengths.
A pattern recognition model for aerosol classification with atmospheric backscatter lidars can be built up based on measured optical properties of aerosols from lidars, as shown in Fig. 1.In general, the main process of this model can be described as follows.First, one must design the pattern recognition classifier for aerosol classification with atmospheric backscatter lidars.Obtaining the decision function for pattern recognition is the key for this stage.To this end, the pattern recognition characteristic database is divided into two parts, one of which is used as the "database" and the other is used as the "training data," so the proper decision function and rule can be chosen according to the identification results.Then, input the samples to be identified into the designed classifier, which consists of the pattern recognition characteristic database and the decision rule.Finally, the classification results can be made according to the decision rule.

Characteristics Sample Database for Aerosol Identification
From the principle block diagram shown in Fig. 1, it is not hard to find that the establishment of the pattern recognition characteristic database is an important step to implement the aerosol identification.Figure 2 shows the process of establishing the pattern recognition characteristic database.First, one can obtain the coarse optical feature vector X 0 and aerosol coarse optical property database H 0 by analyzing the collected database of aerosols from lidars according to the existing categories of aerosols.Then, considering the efficiency and complexity of the model in the practical applications, we should reduce the dimension of X 0 by dimension-reduction analysis, and we can gain the final optical feature vector X used in the model.Finally, the pattern recognition characteristics database H would be obtained by dealing with the aerosol coarse optical property database H 0 according to the final optical feature vector X.
Dimension-reduction analysis can be mainly achieved through feature extraction and feature selection, which is an essential issue in the pattern recognition, and has been proven in both theory and practice effective in enhancing learning efficiency, increasing predictive accuracy, and reducing complexity of learned results. 23Feature extraction requires processing of data fusion and creating new features based on transformations or combinations of the original feature set. 24Feature selection aims to choose an optimal feature subset that contains just the relevant and nonredundant features for the classifier from the input feature set.Many methods have been developed for feature selection, which can be characterized as "embedded," "filter," or "wrapper" approaches. 254][25][26][27] It is worth noting that as many studies have been done and there is clear meaning for the lidar retrieval results, feature selection should be employed if only lidar data were considered.If multisource data have been used for aerosol classification, feature extraction should be considered first, as features can be extracted from the raw data after a series of translation.Thus, a correlation-based feature selection method was adopted to aid the dimension-reduction analysis.The correlation between any feature and the class (C-correlation) as well as the correlation between any pair of features (F-correlation) should be considered.Then, relevant and nonredundant feature subset can be gained by selecting the features that have high C-correlation as well as low F-correlation with each other. 23n conclusion, the pattern recognition characteristic database H has been simplified and the superfluous optical properties have been eliminated when compared with the aerosol coarse optical property database.Thus, a more accurate and efficient pattern recognition can be realized.

Decision Rules
Once the pattern recognition characteristic database for aerosol classification has been established, the classifier can be designed using a number of possible approaches, such as template matching, statistical classification, syntactic or structural matching, neural networks, and so on.Statistical pattern recognition is a very active area of study and research, which has been used successfully to design a number of commercial recognition systems.In the statistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the samples belonging to each class or estimated directly by the data without calculating the probability distributions. 28However, in some sense, most of the approaches in statistical pattern recognition are attempting to implement the Bayes decision rule as the Bayes discriminant method based on the Bayes theory has the minimum error rate compared with other classification methods theoretically. 29f the feature vector of the sample to be identified is x, the probability of the sample belonging to aerosol category ω i in all n-type aerosols can be described as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 2 1 9 where Pðω i Þ represents the prior probability, which can be assigned as equal for all types of aerosols or calculated statistically or even determined experientially, and Pðxjω i Þ represents the probability distribution of the aerosol category ω i .Burton et al. 30 and Cattrall et al. 16 pointed out that various kinds of aerosol characteristic sample database can be considered to obey the multidimensional Gaussian distribution as As can be seen, if one estimates the parameters of the multidimensional Gaussian distribution according to the experimental data, Pðxjω i Þ can be gained by Eq. (10).Then, Pðω i jxÞ, the relative probability of the aerosol sample belonging to class ω i , can be obtained subsequently.Decision rules can be set in two ways: one is to select the maximum relative probability to be the result of aerosol classification.That is, if Pðω i jxÞ ¼ max j¼1;2;: : : ;n Pðω j jxÞ, then x ∈ ω i .
However, the crosstalk between two different types of aerosols may be very strong in some cases.The other is to set a threshold if a rejection decision is allowed.If the relative probability of the aerosol sample belonging to class ω i is larger than the threshold, the aerosol sample can be considered as this kind of aerosol.
Based on the preceding analysis, a refined pattern recognition characteristic database can be obtained from the atmospheric backscatter lidar observation according to the method introduced in the Sec.3.1.Then, it is divided into two parts: one is treated as the database and the other is used as a set of training data, as shown in Fig. 1.A satisfying self-validation would be obtained by choosing a proper decision rule, and this decision rule combining with the pattern recognition characteristic database forms the classifier.Therefore, the aerosol classification can be made.As there are wrongly classified samples, the confident level of the classification results can be evaluated by where Cðω i Þ represents the classification confident level of aerosol category ω i and Pðω i ; ω j Þ represents the relative probability of samples from class ω j classifying into class ω i .It is worth noting that though classification accuracy is the primary evaluation criterion for the classification results, the confidence levels should be taken into account as well.

Analysis of Computer Simulation
Computer simulation for the proposed pattern recognition model of aerosol identification has been carried out and the results will be shown in this section.

Aerosol categories
Aerosols can be categorized in different ways and different databases can be built up correspondingly.The aerosol categories should be chosen according to the regional real situation, and the number of aerosol categories should neither too few nor too many.Here, in this study, aerosols are divided into eight categories: ice particles, pure dust, dust mix, maritime, marine pollution, urban aerosol, biomass burning, and fresh smoke, following Burton et al. 30

Selection of aerosol feature vector
As we are classifying the aerosols, it is interesting to find that some optical characteristics of aerosols, such as depolarization ratio, spectrum depolarization ratio, backscatter color ratio, and lidar ratio, do not change over the concentration of aerosols. 11,16,30One or two of the four characteristics are usually not quite similar among different kinds of aerosols.Therefore, it has a great significance in the classification of aerosols.
As for quantitative analysis, a large number of measurements of known aerosol models were carried out to gain the optical characteristics of various kinds of aerosols.Therefore, the aerosol database can be built up with atmospheric backscatter lidars.After dimension-reduction analysis to the aerosol coarse optical property database according to the previous analysis, we found that the four optical characteristic parameters particle linear depolarization ratio (at 532 nm), lidar ratio (at 532 nm), backscatter color ratio and spectrum depolarization ratio have high C-correlation and the correlation between any pair of these four features are quite low.Thus, feature vector for aerosol classification can be selected as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 6 7 5 x ¼ ðδ 532 a ; S 532 a ; C a ; R a Þ T : Then, the four-dimension feature vector space of aerosol particles can be constructed.Thus, we can focus on these four characteristics specifically in the process of establishing the sample database.The database can be stored for later use.

Characteristics sample database for aerosol identification
The characteristics sample database for aerosol identification used in the computer simulation is constructed from the experimental data of Burton et al., 30 which are obtained in the field task over North America.There are 10,000 samples of ice particles, 3000 samples of pure dusts, 75,000 samples of dust mix, 13,000 maritime samples, 11,000 samples of marine pollution, 85,000 samples of urban aerosols, 22,000 samples of biomass burning, and 39,000 samples of fresh smoke in the database.The projection distribution of the four-dimension feature space for aerosol classification is shown in Fig. 3, 7 where (a) represents the projection in 532-nm depolarization-532-nm lidar ratio space, (b) represents the projection in 532-nm depolarizationbackscatter color ratio space, (c) represents the projection in 532-nm depolarization-depolarization spectral ratio space, and (d)-(f) represent the projection in 532-nm lidar ratio-backscatter color ratio space, the projection in 532-nm lidar ratio-depolarization spectral ratio space, the projection in backscatter color ratio-depolarization spectral ratio space, respectively.

Self-Validation Accuracy
The aerosol classification system is operated in two steps: training (learning) and classification (testing).Two self-validations were carried out in the computer simulations to test the validity Fig. 3 Aerosol database used in computer simulation: (a) projection in 532-nm depolarization-532-nm lidar ratio space, (b) projection in 532-nm depolarization-backscatter color ratio space, (c) projection in 532-nm depolarization-depolarization spectral ratio space, (d) projection in 532nm lidar ratio-backscatter color ratio space, (e) projection in 532-nm lidar ratio-depolarization spectral ratio space, and (f) projection in backscatter color ratio-depolarization spectral ratio space.
and stability of the pattern recognition model for aerosol identification with atmospheric backscatter lidars in these two processes, respectively.
In the process of classifier training, the strict self-validation was carried out to determine the decision rule.As shown in Fig. 1, the characteristics sample database was divided into two parts: one was treated as the pattern recognition database and the other was used as train data.We adopt the k-fold cross-validation method 31 considering that the number of pure dust sample points is relatively few.First, the characteristics sample database was divided into 50 parts randomly; second, a decision function was selected; third, we pick one part as the training data and the remained 49 parts as the pattern recognition database to train the classifier.This is repeated for 50 times with only picking different parts as the training data each time.In this way, the accuracy of the classifier with a certain decision rule can be calculated by recording and adding the results of 50 cycles.
At first, we adopt the first decision rule to design the classifier, that is, we calculate the relative probability of the aerosols belonging to any kind of categories and label the aerosol sample belonging to this category if the probability is maximal.The detailed analysis results of strict self-validation are shown in Fig. 4(a).The vertical axis in the figure represents the types of aerosol to be tested in the self-validation, and the horizontal axis represents the categories of aerosol in which the samples are classified into.The light and dark colors represent the corresponding probabilities.For example, look at the row with a vertical axis of "ice," the first dark blue square with a horizontal axis of "ice" represents samples of ice is still classified as ice, and how dark of the square represents the corresponding probability in proportion to the total ice sample points according to the color bar in the bottom of the figure.The second light blue square with a horizontal axis of "pure dust" represents the samples of ice are classified as pure dust mistakenly, and the rest may be deduced by analogy.Some of the self-validation accuracies are marked correspondingly in the figure, but the corresponding probabilities less than 1% are not marked in the figure .From the results of self-validation, we can find that the reidentification of urban aerosol is the most difficult and only about 89% of the sample points can be correctly distinguished.The main reason is that the overlapping area of urban with other kinds of aerosol categories is larger than the others, especially in the 532-nm depolarization-backscatter color ratio space [see Fig. 3(b)] and 532-nm lidar ratio-backscatter color ratio space [see Fig. 3(c)], where urban almost cannot be separated from other kinds of aerosols.Meanwhile, the accuracies of maritime pollution aerosol and fresh smoke are relatively low, too.This is mainly because the feature vectors of these two types of categories overlap with urban aerosol as the crosstalk among them is quite serious.The reidentification accuracies of other five types of aerosols, such as ice and dust mixing, reach 94% or even more.The pure dust is reidentified almost without error.
Obviously, the crosstalk between different types of aerosols will lower the confidence of the classification results.Maybe adopting a rejection decision with a threshold as described in Sec.3.2 would make the classification results more confident.After a process of analysis, a threshold of 55% was adopted in the classifier.That is, we use the k-fold cross-validation method introduced earlier and calculate the relative probability of the aerosol sample belonging to any kind of aerosol.Then, we can consider the aerosol sample belonging to this kind of aerosol if the probability is over 55%.The detailed analysis results of self-validation with a proper threshold are shown in Fig. 4

(b).
From the self-validation result shown in Fig. 4(b), one can see that the reidentification accuracies of eight types of aerosols are lower compared to the self-validation results without a threshold, but the crosstalk is reduced in the meantime.In the further analysis, it is noticed that the effects of setting a threshold to the validation accuracies of ice, pure dust, dust mix, and maritime samples are quite small, which means that the classification results of these four types of aerosols are relatively confident.Thus, it is necessary to set a threshold if a confident result is required.A best threshold should be decided by the data of repeated experiments according to practical needs.
After the classifier is designed, the generalized self-validation was carried out in the classification mode.The testing data are simulated samples, which have the same distribution but different values with the pattern recognition characteristic database.We still set the threshold at 55% and the generalized self-validation accuracies of eight types of aerosols are shown in Fig. 5(a).We carried out simulations four times and an average is calculated and marked out correspondingly in the figure .It can be seen that the results of strict self-validation and generalized self-validation are very close.Therefore, we believe that the results of self-validation are close to a stable state.Thus, a change of the value of sample points would not affect the reidentification results a lot as long as the distribution of the used database is the same.The crosstalk between each type of aerosol in the generalized self-validation is similar to that in the strict self-validation; thus, we will not discuss the crosstalk issue in the generalized selfvalidation further.
In addition, we carried out a sensitivity analysis by perturbing each sample point in the pattern recognition characteristic database 1000 times within different measurement uncertainties as testing data, and the classification accuracies of eight categories of aerosols are shown in Fig. 5(b).From the results, we can see that the effects of measurement uncertainties on classification accuracies of ice, dust mix, maritime, and biomass burn are relatively low.Nevertheless, the classification accuracies of pure dust and fresh smoke are affected relatively seriously by the measurement uncertainties.In the whole, the effects of measurement uncertainties of less than 15% are quite acceptable.
However, it should be pointed out that in order to obtain these four characteristic parameters of aerosol at the same time, dual-wavelength polarized HSRL would be the best choice.Yet, dual-wavelength polarized HSRL has not been widely accepted and used because of the restriction of research depth and technology.The single-wavelength HSRL (at 532 nm) and other atmospheric remote sensing equipment (such as single-wavelength polarization Mie scattering lidar) are usually used together to obtain all of these optical characteristic parameters at present.

Behavior in a Reduced Dimension Status
As most lidars used currently are not dual-wavelength polarized HSRLs and a joint measurement of different equipment is also hard to carry out, it is difficult to obtain all the four optical characteristic (depolarization ratio δ a , spectral depolarization ratio R a , backscatter color ratio C a , and lidar ratio S a ) at the same time.As mentioned earlier, a dual-wavelength Mie lidar with polarized sensitivity at 532 nm can only retrieve the depolarization ratio δ a and the color ratio C a , a single-wavelength polarized HSRL can only retrieve depolarization ratio δ a and lidar ratio S a , and a single-wavelength polarized Mie lidar can only get depolarization ratio δ a .These limitations make it very difficulty to classify aerosol with atmospheric backscatter lidars.Although the pattern recognition model for aerosol classification with atmospheric backscatter lidars proposed in this paper can be applied to both dual-wavelength polarized HSRL and nondouble-wavelength polarized HSRL, the accuracy would have some differences compared with the results of dual-wavelength polarized HSRL.The results of self-validation for nondoublewavelength polarized HSRL (such as dual-wavelength Mie lidar with polarized sensitivity at 532-nm, single-wavelength polarized HSRL, single-wavelength polarized Mie lidar, and so on) without a decision threshold are shown Fig. 6.
Compared to the self-validation of the dual-wavelength polarized HSRL shown in Fig. 4(a), the reidentification accuracies in a reduced dimension status is lower and the crosstalk becomes worse.Therefore, we decided to adopt a rejection decision by setting a threshold.A threshold of 50% is set by balancing the self-validation accuracies and crosstalk between each type of aerosols.The detailed results of self-validation are shown in Fig. 7.
According to the self-validation results, although a single-wavelength polarized HSRL can only obtain depolarization ratio δ a and lidar ratio S a , the distributions of various types of aerosols are relatively independent in the two-dimensional space consisting of depolarization ratio and lidar ratio.Thus, the identification accuracies of various aerosols are still quite large: the reidentification accuracies of ice, pure dust, dust mix, and maritime are all over 90% and the reidentification accuracy of pure dust is up to 99%.However, the reidentification accuracies of other four types of aerosols are quite low and the crosstalk between marine pollution and fresh smoke as well as urban aerosol and biomass burn is quite serious.It seems hard to distinguish urban aerosol from biomass burn only using the data from a single-wavelength polarized HSRL, because these two almost overlap in the 532-nm depolarization-532-nm lidar ratio space, as shown in Fig. 3.The marine pollution and fresh smoke are hard to be distinguished as their overlapping area in the four-optical feature space is quite large, too.
Compared with single-wavelength polarized HSRL, dual-wavelength polarized Mie lidar has a weaker ability to recognize the components of various types of aerosols.It can only distinguish pure dust and dust mix better.However, the crosstalk between pure dust and ice particles is quite serious, up to 16.5%, as the overlap of pure dust and ice particles is a little large in the 532-nm depolarization-color ratio space as shown in Fig. 3.Although the self-validation accuracies of maritime, marine pollution, and fresh smoke are all over 70%, the crosstalk between them and other categories is very serious, which leads to the confidence of the identification results being very low.However, the combination of maritime and marine pollution has relatively little crosstalk compared with the others.Therefore, if we classify aerosols into five categories according to the data from dual-wavelength Mie lidar (polarized at 532-nm channel), the results are quite acceptable.
As for single-wavelength polarized Mie lidar, its ability to classify the aerosols is even weaker.It can only reidentify pure dust and dust mix, and the crosstalk between ice particles and pure dust is even more serious than dual-wavelength polarized Mie lidar.It cannot distinguish the remaining types of aerosols from each other very well just using the data from singlewavelength polarized lidar, so we would consider the remaining of aerosol types (maritime, marine pollution, urban, biomass burn, and fresh smoke) as one category.When setting a threshold of 50%, the classification results are also acceptable.

Discussions
According to the results of computer simulation, we can conclude that the self-validation accuracies of dual-wavelength polarized HSRL are relatively high and the crosstalk is quite low.The classification accuracies stay relatively stable when the measurement uncertainties are less than 15%, so the stability of the model has been demonstrated from the generalized self-validation and sensitivity analysis.The analysis of the behavior in the reduced dimension status demonstrates the generalization ability of this model, which means that it can be applied to different polarization lidar configurations.As the crosstalk between marine pollution and fresh smoke as well as that between urban aerosol and biomass burning are quite serious, aerosols can be divided into six categories quite credibly by single-wavelength polarized HSRL.Similarly, a dual-wavelength Mie lidar with polarized sensitivity at 532-nm channel can classify aerosols into five categories at an acceptable confidence level.The ability to classify the aerosols of a single-wavelength polarized lidar is quite weak, but it can distinguish ice particles, pure dust, and dust mix from other aerosols using the method proposed in this paper.It should be noted that the database Fig. 7 The details of self-validation analysis for nondouble-wavelength polarized HSRL with decision threshold: (a) the self-validation accuracies analysis for dual-wavelength Mie scattering lidar (polarized at 532-channel), (b) the self-validation accuracies analysis for single-wavelength polarized HSRL lidar, and (c) the self-validation accuracies analysis for single-wavelength polarized Mie scattering lidar.
used is not based on a real dual-wavelength polarized HSRL.In fact, the HSRL used by Burton et al. 30 in the field task adopts HSRL technology only at the 532-nm channel and the 1064-nm channel is just a standard polarized Mie lidar that needs a prior assumption of the lidar ratio before retrieving. 32If data from a real dual-wavelength polarized HSRL can be obtained, the self-validation accuracy would be higher and the crosstalk would be suppressed as well.
It should also be mentioned that the aerosols database used in the computer simulation is obtained by simulation through combining the data sets that were acquired from the 18 field missions conducted by NASA LaRC over North America and the previous research of Burton et al. 30 Although the database has certain representativeness, it may show some difference with the concrete situation in other countries.Also, the classification of aerosols in different situations may differ, e.g., the space-based lidar, CALIOP, adopts a classification of six categories, which is quite different from the classification categories used by Burton et al. 30 Therefore, the validation of the model's universality is carried out using the data from CALIOP in Sec. 5.

Analysis of the Application to Cloud-Aerosol Lidar with Orthogonal Polarization
CALIOP was launched on April 2006 aboard the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite, which is a joint mission between NASA and the French space agency (CNES).CALIOP is a system of dual-wavelength Mie lidar with polarization sensitivity (532-nm channel) and its products can be employed in various applications such as atmosphere studied and earth observation. 32The retrieval algorithm of CALIOP is similar to the common elastic lidar as a priori assumptions are needed.The aerosol layers are assigned to one of six aerosol types (desert dust, biomass burning, clean continental, polluted continental, marine, and polluted dust), each having a characteristic lidar ratio that is mainly based on the cluster-analysis of the AERONET data set. 33,34Therefore, CALIOP can provide vertical structure and properties of thin clouds and aerosols over the global scale and the data are available on the NASA website. 35In this study, the layer-integrated particulate depolarization ratio, layer-integrated particulate color ratio, layer-integrated attenuated backscatter coefficient (532 nm), and feature classification flags in the level 2 version 3-layer products, namely the 5-km aerosol layer products, were used to verify the aerosol classification model proposed.The data used are limited to the latitude-longitude grid of (3°N∼54°N, 73°E∼136°E) covering the geographical range of China and surrounding areas over the year 2014.

Data Quality Screening Strategy
As an estimated lidar ratio was used in the retrieval algorithm of CALIOP, an untrustworthy retrieval result may be gained.In order to include only well-defined aerosol layers, a quality filter was used in the data processing. 36The cloud-aerosol-discrimination (CAD) score was adopted to assess the uncertainty of cloud aerosol discrimination algorithm.The standard CAD score ranges from −100 (most confident to be aerosols) to 100 (most confident to be clouds), but layers with CAD score between −20 and 20 are usually the results of erroneous layer detection contaminated by noise.Therefore, a CAD score filter was set to determine aerosol layers with CAD score between −100 and −20.In the meanwhile, bit 13 of feature classification flags was limited to 1, which means the subtype of classification was with confidence.
As the lidar ratio is estimated with assumptions, the initial lidar ratio would be adjusted in the retrieval processing, which usually occurs for complex features and induces instabilities in the algorithm and larger uncertainties in the retrieved extinction.Therefore, a quality filter was used to determine aerosol layers having extinction QC flag values of 0 or 1.
The third screening filter excludes samples where aerosol layer-integrated particulate depolarization ratio uncertainty or layer-integrated particulate color ratio uncertainty is 99.99.Uncertainty of 99.99 is a flag value assigned by the extinction retrieval algorithm when the error estimates can become unstable, and the uncertainty calculation value can grow excessively large.The data quality screening filters are shown in Table 1 in detail.

Selection of Aerosol Feature Vector
Similar to the computer simulation in Sec. 4, the particulate color ratio and depolarization ratio were selected in the feature subset as they are relevant to the classification.Considering that the SNR of satellite-based lidar is relatively low, the integrated particulate depolarization ratio (IPDR) E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 5 7 2 and integrated particulate color ratio (IPCR) are used for aerosol classification instead.
According to the retrieval algorithm, we can see that the integrated attenuated backscatter at 532 nm (γ 0 532 or IAB) is helpful for aerosol classification.γ 0 532 has a high correlation to classification and low correlation to IPDR or IPCR as well, so γ 0 532 is selected into the feature subset E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 4 1 3 where β is the total (molecular + aerosol) backscatter coefficient and T is the atmospheric transmittance due to both the molecules and aerosols.Thus, the feature vector for aerosol classification can be selected as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 1 1 6 ; 3 3 3 x ¼ ðδ p;layer ; χ p:layer ; γ 0 532 Þ T : Then, the 3-D feature vector space can be gained by analyzing the CALIOP data for year 2014.There are about 130,000 samples of clean marine, 85,000 samples of desert dust, 35,000 samples of polluted continental, 6700 samples of clean continental, 146,000 samples of polluted dust, and 53,000 samples of biomass burning.The projection distribution of the 3-D feature space for aerosol classification is shown in Fig. 8, where (a) represents the projection in IPDR-IPCR space, (b) represents the projection in IPDR-IAB space, and (c) represents the projection in IPCR-IAB space.

Identification Results
The analysis of the application to CALIOP is quiet similar to the computer simulations in Sec. 4. We use the data over 2014 as a database to design the classifier, and the k-fold cross-validation method is also adopted in the processing of classifier design as the sample points of clean continental are relatively few.At first, we adopt the first decision rule to design the classifier; the results of strict self-validation accuracies of six types of aerosols without rejection decision and detailed analysis results are shown in Fig. 9.
The reidentification of clean marine, desert dust, clean continental, and polluted dust is relatively acceptable, especially since the self-validation accuracy of clean continental aerosol is over 95%.However, the crosstalk between different types of aerosols is quite serious at the same time.The crosstalk between polluted continental and biomass burning is up to 58%, which means that it is fairly difficult to distinguish between polluted continental and biomass burning.When a rejection decision is adopted and the threshold is optimized according to the self-validation, the detailed results are shown in Fig. 10.Comparing the results shown in Figs. 9 and 10, we can conclude that the application to CALIOP can reidentify clean marine, desert dust, clean continental, and polluted dust quite well with a relatively high confidence level.However, the serious crosstalk between polluted continental and biomass burning leads to a rejected decision for most of the polluted continental and biomass burning layers.

Discussion
According to the analysis of application to CALIOP, an acceptable classification result of clean marine, desert dust, clean continental, and polluted dust can be achieved, and the self-validation accuracies of desert dust and clean continental is over 80%, but the crosstalk between polluted continental and biomass burning is too serious to be distinguished.The main reason is that the two aerosol models (polluted continental and biomass burning) used in CALIOP have similar compositions. 34Moreover, the lidar ratio assigned to these two kinds of aerosols are similar, 70 sr at 532 nm and 40 sr at 1064 nm for smoke, and 70 sr at 532 nm and 30 sr at 1064 nm for polluted continental. 31Thus, the overlapping area of polluted continental and smoke is very large in the optical features space.On the other hand, the CALIOP retrieval algorithm uses a decision tree, which takes into account not only the measured optical feature but also aerosol location, height, and surface type to classify aerosol layers into six types.Therefore, the serious crosstalk between polluted continental and smoke is not a surprise.Since polluted continental and biomass burning almost overlap in the current optical feature space and the separation of them cannot be realized only through these optical features, we combine them under the label "urban" to perform the classification processing.That is, we classify aerosol samples into five catalogs (clean marine, desert dust, combined urban, clean continental, and polluted dust) according to CALIOP data.The results of strict self-validation accuracies of classification into five catalogs with a rejection decision after an optimized decision threshold is adopted are shown in Fig. 11.As one can see, the reidentification results are quite acceptable when aerosols are classified into five categories.

Summary and Conclusions
A pattern recognition model for aerosols identification with atmospheric backscatter lidars is studied and the feasibility of using lidars to detect the components of aerosols is discussed in this paper.This model has good generalization ability and can be applied to various database and classifications of aerosols.The process of building the characteristics sample database for aerosol classification, the aerosol optical characteristics vector, and the pattern recognition model are described in detail.Meanwhile, computer simulation for the proposed pattern recognition model of aerosol identification has been carried out.The model has a good stability when the number of the sample points in the aerosol database is big enough according to the results of self-validation.Reidentification accuracies and crosstalk between each type of aerosol particles were analyzed, and the role of the threshold for aerosol classification in suppressing the crosstalk is studied and proved.
In addition, the applicability of this model in a reduced dimension status is analyzed in detail.Therefore, we can conclude that single-wavelength polarized HSRL has a better ability to identify the components of aerosols than dual-wavelength polarized Mie lidar, and single-wavelength polarized Mie lidar has the weakest ability to identify the components of aerosols in these three kinds of lidars.Single-wavelength polarized HSRL has a better capacity for the reidentification of ice particles, pure dust, dust mix, and maritime.Dual-wavelength polarized Mie lidar has a good ability to distinguish ice particles, pure dust, and dust mix, but single-wavelength polarized Mie lidar can only reidentify pure dust and dust mix well.It is also helpful in understanding the main optical characteristics that contribute to classify different kinds of aerosols.
The application to CALIOP was then carried out and analyzed in detail to illustrate the generalization ability of the model proposed in this paper.The desert dust and clean continental can be reidentified correctly with high confidence, but the crosstalk between polluted continental and biomass burning is too serious to be distinguished as there are many similar characteristics between them.When we label polluted continental and biomass burning as one catagory, we can classify aerosols into five catagories quite acceptably.
In short, pattern recognition model for aerosol classification with atmospheric backscatter lidars studied in this paper has good generalization ability and also good performance.It thus provides an alternative method for aerosol classification.At the same time, the huge advantages of polarized HSRL, especially dual-wavelength polarized HSRL, in the application of aerosol classification is highlighted after the analysis of this model in the reduced dimension status.

Fig. 1 Fig. 2
Fig. 1 Principle diagram of aerosol classification with atmospheric remote sensing lidars based on pattern recognition.

Fig. 4
Fig. 4 Details of the strict self-validation results: (a) the accuracies of strict self-validation without decision threshold and (b) the accuracies of strict self-validation with decision threshold.

Fig. 5
Fig. 5 Results of the generalized self-validation and sensitivity analysis: (a) the generalized selfvalidation accuracies of eight types of aerosols and (b) classification results of sensitivity analysis with different uncertainties.

Fig. 6
Fig. 6 Details of the self-validation analysis for nondual-wavelength polarized high-spectral resolution lidar (HSRL) without decision threshold: (a) the self-validation accuracies analysis for double-wavelength Mie scattering lidar (polarized at 532-channel), (b) the self-validation accuracies analysis for single-wavelength polarized HSRL lidar, and (c) the self-validation accuracies analysis for single-wavelength polarized Mie scattering lidar.

Fig. 9 Fig. 10
Fig. 9 Details of the self-validation results without decision threshold using CALIOP data.

Fig. 11
Fig. 11 Details of the self-validation analysis of classifying CALIOP data into five catagories with decision threshold.

Table 1
The data quality screening strategy of Cloud-Aerosol Lidar with Orthogonal Polarization.