Adversarial autoencoder for detecting anomalies in soldered joints on printed circuit boards

Abstract. The inspection of solder joints on printed circuit boards is a difficult task because defects inside the joints cannot be observed directly. In addition, because anomalous samples are rarely obtained in a general anomaly detection situation, many methods use only normal samples in the learning phase. However, sometimes a small number of anomalous samples are available for learning. We propose a method to improve performance using a small number of anomalous samples for training in such situations. Specifically, our proposal is an anomaly detection method using an adversarial autoencoder (AAE) and Hotelling’s T-squared distribution. First, the AAE learns features of the solder joint following the standard Gaussian distribution from a large number of normal samples and a small number of anomalous samples. Then, the anomaly score of a solder joint is calculated by Hotelling’s T-squared method from the features learned by the AAE. Finally, anomaly detection is performed by thresholding using this anomaly score. In experiments, we show that our method performs anomaly detection with few false positives in such situations. Moreover, we confirmed that our method outperforms the conventional method using handcrafted features and a one-class support vector machine.


Introduction
Inspection of the solder joints on a printed circuit board (PCB) is challenging because such defects cannot be observed directly due to the solder joints being sandwiched between the PCB and an integrated circuit (IC) chip. To solve this problem, automated x-ray inspection, which can perform nondestructive inspection, is generally employed. 1,2 In our method, we employed an automated x-ray inspection that collects sliced images of the solder joints by x-ray computed tomography (CT) scans on the x-ray inspection machine and detects defects in the solder joints.
In recent years, automatic visual inspection systems using machine learning, especially deep learning, have been studied as a method of classifying normal and anomalous samples. This is motivated by the fact that inspection by human experts is problematic, with fatigue possibly causing the expert to miss anomalous samples. One of the most popular anomaly detection methods using machine learning is a one-class support vector machine (OCSVM). 3 This method requires handcrafted features extracted by human experts in advance. Then, the extracted features are input to the trained OCSVM, and inputs are classified by the output of OCSVM. In this case, OCSVM is trained with only normal samples, but it has the disadvantage of the feature needing to be designed by human experts in advance and requiring redesign of the feature extraction method when the product specification is changed. When deep learning methods are used, because product images are directly inputted to neural networks, extracting features by human experts is not required. Therefore, even if the product specification is changed, only network retraining is required; thus the operating cost can be greatly reduced. In general, one of the anomaly detection methods using deep learning is to classify normal and anomalous samples using a binary classifier. 4,5 However, in anomaly detection for industrial products, it is difficult to guarantee enough anomalous product samples for training the classifier because defects rarely occur on the production line. Therefore, anomaly detection is generally performed using only normal data. 3,6 However, because a small number of anomalous samples is sometimes available for the learning phase, improvement of performance can be expected by adding anomalous samples to the training dataset. In this method, normal samples as well as a small number of anomalous samples were used for learning. In particular, our method extracts features following the standard Gaussian distribution by an adversarial autoencoder (AAE) 7 from such imbalanced samples. Furthermore, anomaly scores are calculated from the features by Hotelling's T-squared method 8 and each solder joint is classified by an anomaly score threshold. In this experiment, we show that our method is superior to the method using handcrafted features and OCSVM on the imbalanced samples. Our contribution is a method that detects defects from a large number of normal samples and a small number of anomalous samples during the quality inspection of industrial products.

Related Work
Recently, the x-ray CT method has been mainly used to detect anomalies in PCB solder joints because they cannot be observed directly. The x-rays pass through the PCB because it consists of materials with low atomic weight, but solder joints are imaged because they have high atomic weight. 9 For example, the solder ball portion of the solder joints is represented as voxel data to obtain the condition of the solder joints using two-dimensional x-ray CT images taken from multiple directions. 10 The voxel data are input to a three-dimensional convolutional neural network and classified by the output of the network. However, in typical anomaly detection tasks, a neural network classifier has the problem of requiring both normal and anomalous samples for the training stage, and their prediction performance is unstable for unknown anomalous samples not seen in training samples. Therefore, training methods that can produce satisfactory classification results when only normal samples or a small number of anomalous samples are used are needed.
A previously developed anomaly detection method uses an OCSVM in the latent space of extracted features. However, this has some disadvantages. The feature extraction method must be designed beforehand, and the features are changed by every target. To solve this problem, an autoencoder, 11 which is a model of a neural network, extracts the features in the latent space from the input samples automatically. In anomaly detection methods using an autoencoder, methods based on reconstruction error 6 and the normal condition model in the latent space 12 are used. The proposed method belongs to the latter approach. In the former method, the networks are usually trained with only normal samples. As a consequence, the networks can reconstruct normal samples with small reconstruction errors; however, anomalous samples cannot be reconstructed, and the reconstruction errors become large. Hence, the samples are classified by a threshold for reconstruction errors. In the latter method, a normal model is defined in the latent space, and the likelihood of an input sample being in this space is calculated to classify it. In Ref. 12, test samples are classified by a threshold not only for the reconstruction error but also for the likelihood for a Gaussian distribution of the features extracted by the AAE. Compared with our method, it is different in terms of thresholding on the reconstruction error and likelihood, rather than on anomaly scores calculated by Hotelling's T-squared method.

X-Ray Computed Tomography
Because the solder joints sandwiched between the PCB and the IC chip cannot be inspected directly, we obtain sliced images of the solder joints with x-ray CT. When the IC chip and the PCB are joined, many solder joints are formed. Our approach is to detect each solder joint and cut out these places in advance to capture sliced images of each solder joint. The number of sliced images λ is taken from each solder joint, and we define these sliced images as the sample for one solder joint. Hence, only anomalous solder joints can be treated as anomalous samples in each PCB where anomalous solder joints exist. An overview of the method for capturing sliced images of a solder joint is shown in Fig. 1. We took eight sliced images from one solder joint. Each is assigned a layer number corresponding to its image layer.
Examples of the captured sliced images of normal samples are shown in Fig. 2(a), and Fig. 2(b) shows anomalous samples.

Hotelling's T-Squared Method
The number of anomalous samples is much smaller than the number of normal samples; thus, the normal model is defined from only normal samples or a small number of anomalous samples. If it is assumed that the normal model generated from the dataset Z ¼ ðz 1 ; z 2 ; · · · ; z n Þ and each z ¼ ðz 1 ; z 2 ; · · · ; z d Þ ∈ R d is represented by the parameter θ, the negative log-likelihood probability aðz 0 Þ of unknown sample z 0 is defined as an anomaly score in the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 6 5 6 In the normal model qðzjθÞ, the probability density of the normal samples is high and that of the anomalous samples is low. Therefore, the anomaly scores of the former are low and those of the latter are high, and it is possible to classify normal and anomalous samples by a threshold on the anomaly score. Hotelling's T-squared distribution is an anomaly detection method that can be applied to a dataset following a Gaussian distribution. Here, aðz 0 Þ of z 0 is calculated as Eq. (2) using the two parameters of the Gaussian distribution, latent vector μ and variance-covariance matrix Σ: ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 5 4 1 The last term of Eq. (2) is equal to the Mahalanobis distance. Moreover, if μ ¼ 0 and Σ ¼ I, the dataset follows the standard Gaussian distribution, and aðz 0 Þ is calculated by the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 4 2 4 The last term of Eq. (3) is equal to the Euclidean distance. In Hotelling's T-squared method, aðz 0 Þ follows the chi-square distribution with the degree of freedom of d and the scale factor of 1.
The chi-square distribution with d ¼ 16 is shown in Fig. 3. In Fig. 3, the graph shows the likelihood of the aðz 0 Þ value of z 0 sampled from the normal model following the standard Gaussian distribution. When an aðz 0 Þ value is high, the probability of being a normal sample is low; therefore, the sample can be regarded as anomalous. Hence, it is possible to classify normal and anomalous samples by predetermining any upper probability on the graph and setting a one-dimensional threshold. Fig. 3 Plot of the chi-square distribution. The vertical axis is the density of the distribution, and the horizontal axis is the aðz 0 Þ value. Degree of freedom d ¼ 16.

Adversarial Autoencoder
Although images are high-dimensional data, they can be compressed to lower-dimensional features in the latent space. This is because normal samples are assumed to have common features. An autoencoder is a low-dimensional feature extractor for neural networks. An autoencoder is composed of two networks: encoder (En) and decoder (De). En is trained to extract features as latent vector z ∼ qðzÞ from input x ∼ p data ðxÞ, where p data ðxÞ is the data distribution of the input samples. De is trained to reconstruct the input x from z. The loss function is shown as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 6 4 4 Principal component analysis 13 is another conventional dimensional reduction method, but it can only map linearly from the high-dimensional space to the low-dimensional latent space. The autoencoder enables nonlinear mapping using activation functions and deep layers. This leads to the model extracting more representative features of complex structured data because the projection functions En and De are more flexible.
Although low-dimensional features of input samples can be acquired by the autoencoder, the distribution of the features in the latent space cannot be specified. Therefore, to apply the Hotelling's T-squared method described in Sec. 3.2 to the distribution of the features extracted by the autoencoder, we employ an AAE consisting of the autoencoder and discriminator networks shown in Fig. 4. The AAE allows for matching of the distribution of the latent space to an arbitrary distribution by an adversarial manner. 7 To incorporate Hotelling's T-squared method to the deep generative model, we train the AAE with an adversarial loss between the distribution of the encoded latent vectors and the standard Gaussian distribution. Furthermore, we assume the realworld situation in which a large number of normal samples and a small number of anomalous samples are available. The adversarial training with such imbalanced samples facilitates the normal samples being mapped to the high density of the standard Gaussian distribution and the anomalous samples being mapped to the low density. This means that the AAE constructs a normal model that follows the standard Gaussian distribution in the latent space. Therefore, it is possible to apply Hotelling's T-squared method in the latent space. The reason for defining the arbitrary distribution as a standard Gaussian distribution is to simplify the anomaly score calculations described in Sec. 3.2.
The discriminator is trained to determine whether the input vector is sampled from latent distribution qðzÞ or from standard Gaussian distribution pðzÞ. In contrast, the En is trained to approximate qðzÞ to pðzÞ. These actions are called adversarial training and are defined in Fig. 4 Architecture of an AAE consisting of an autoencoder and discriminator. In the autoencoder, the En extracts latent vector z from input images x sampled from p data ðxÞ, and the De reconstructs x from z. The discriminator determines whether the input is sampled from standard Gaussian distribution pðzÞ or latent distribution qðzÞ. a loss function as Eq. (5). The En is trained to minimize and the De is trained to maximize function V, and E means cross entropy between a subscript and square brackets. E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 7 1 1 min En max D VðD; EnÞ ¼ E z∼p ½logðDðzÞÞ þ E x∼p data ½logð1 − DðEnðxÞÞÞ: (5) Discriminator ðDÞ updates its own parameters to output DðzÞ ¼ 1 when input vector z is sampled from pðzÞ and output DðzÞ ¼ 0 when z is sampled from qðzÞ. Therefore, when the discriminator maximizes Eq. (5), it can determine whether the input z is sampled from pðzÞ or qðzÞ. The loss function of the discriminator L D is transformed from Eq. (5) to Eq. (6) as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 6 2 7 In contrast, Eq. (5) is minimized when the En can approximate qðzÞ to pðzÞ sufficiently, and the loss function of the En L En can be transformed from Eq. (5) to Eq. (7) as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 5 5 2 To summarize, the AAE is trained to repeat the following procedure: 1. Each solder joint in a PCB was detected and λ sliced images were captured by x-ray CT on an x-ray inspection machine. 2. The number of sliced images λ was combined into one sample, and the sample was input to the AAE network as λ channels. The AAE was trained with a large number of normal and a small number of anomalous samples. 3. Test samples were input to the trained AAE, and the latent vector was obtained from the output of the En. The anomaly score for each latent vector was calculated by Hotelling's T-squared method. 4. Normal and anomalous samples were classified by setting an anomaly score threshold.
The architecture used in the experiments is shown in Fig. 5. Each of the sliced images consisted of eight-layer images, as shown in Fig. 1. We resized the sliced images to 64 × 64 pixels and input λ ¼ 8 sliced images to the AAE network as eight channels.
We compared our method with a method using handcrafted features and OCSVM. This is to show that our method using features extracted automatically by AAE is superior to the classification by machine learning using features designed by human experts. The handcrafted features designed by human experts were four-dimensional features: the substrate area, head-in-pillow area, circularity, and luminance ratio.
The experimental results are shown in Table 1. In this table, the result for handcrafted features + OCSVM was with the condition of using all normal samples for training the OCSVM, and we show another result of training the OCSVM with fewer normal samples in Table 2. This result shows that the accuracy improved as the number of training samples increased; however, it was inferior to the proposed method even if all normal samples were used for training the OCSVM. The AAE architecture used in the experiments is shown in Fig. 5. The inputs to the network were 64 × 64 × 8. We used the AAE parameters of batch size = 64 and epoch = 100, and the OCSVM parameter γ ¼ 0.11 and radial basis function kernel. Our code is available at https://github.com/rearwist3/aae_solder_tf. We chose 100 epochs empirically by observing the performance of the model every 20 epochs over 200 epochs. Figure 6(a) contains all of the results, and Fig. 6(b) omits the results at 20 epochs to show the details of the false positive rate (FPR) from 40 to 200 epochs. Because low FPR was obtained at 100 epochs and 120 epochs with 10 anomalous training samples, we chose 100 epochs. In the network, the computation time of the learning phase through 100 epochs was ∼80 min on an RTX 2080 Ti GPU.
We set the threshold as 100% true positive rate (TPR) in both models to avoid classifying anomalous samples as normal. FPR of handcrafted features + OCSVM was 1.10% after training with 3,510,000 normal samples. In contrast, AAE + Hotelling's T-squared method could be trained with only 40,000 normal and 10 anomalous samples, and it could classify normal and  anomalous samples with fewer false positives. To verify the results, we selected 10 anomalous training samples at random three times and trained the network with each dataset for 100 epochs. Mean and standard deviation of the resulting FPR are 0.07 AE 0.01%. We show the results when training the network with 0, 20, 50, and 100 anomalous samples to prove that including anomalous samples in the training dataset improves anomaly detection performance and to find the optimal balance between normal and anomalous samples for training the network. The resulting FPR is 5.15%, 0.93%, 0.10%, and 1.25%, respectively. The results confirm that including anomalous samples in the training dataset is effective in the anomaly detection method and the case with 10 anomalous training samples had the best performance and the fewest anomalous training samples. Moreover, we compared our method with classification by a binary classifier, which is a typical anomaly detection method using deep learning. By this experiment, we show the effectiveness of the proposed method under the condition in which a sufficient number of anomalous samples for the training classifier cannot be guaranteed. The result when the binary classifier is trained with a large number of normal and a small number of anomalous samples is shown in Table 3. The classifier could not classify normal and anomalous samples when the number of normal and anomalous training samples was imbalanced. Moreover, we show the result for the binary classifier under the condition in which the number of normal and anomalous samples is not imbalanced in Table 3. In this experiment, we reduced the number of normal samples to match the number of anomalous samples to equalize each sample class and then trained the classifier. Neither result was as good as that of the proposed method, and we thus conclude that the proposed method is effective when the number of anomalous samples is small.

Conclusion
In this paper, we propose a method for inspecting solder joints on PCBs by an anomaly detection method using an AAE. We captured sliced images of solder joints using x-ray CT, and the sliced image features following the standard Gaussian distribution were extracted by the AAE. Defects were detected by applying Hotelling's T-squared method to these features. Experimental results showed that the AAE could classify normal and anomalous samples with few false positives even when the number of data samples was small. However, when compressing high-dimensional Table 3 The results when the samples were classified by the binary classifier. The first row denotes the results of training the classifier with the imbalanced dataset (without undersampling). The second row denotes the results of undersampling the dataset (with undersampling).  data to low-dimensional space, the number of low latent dimensions required for the full expression of high-dimensional data depends on the inputs, and we need to select the optimal number of latent dimensions for every dataset. Statistical implementation of methods to optimize the number of latent dimensions will be studied in future work.