Improved Hough transform by modeling context with conditional random fields for partially occluded pedestrian detection

Abstract. Traditional Hough transform-based methods detect objects by casting votes to object centroids from object patches. It is difficult to disambiguate object patches from the background by a classifier without contextual information, as an image patch only carries partial information about the object. To leverage the contextual information among image patches, we capture the contextual relationships on image patches through a conditional random field (CRF) with latent variables denoted by locality-constrained linear coding (LLC). The strength of the pairwise energy in the CRF is measured using a Gaussian kernel. In the training stage, we modulate the visual codebook by learning the CRF model iteratively. In the test stage, the binary labels of image patches are jointly estimated by the CRF model. Image patches labeled as the object category cast weighted votes for object centroids in an image according to the LLC coefficients. Experimental results on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets demonstrate the effectiveness of the proposed method compared with other Hough transform-based methods.


Introduction
Pedestrian detection is a fundamental challenge in computer vision due to great variation in appearance, changes in illumination, poor resolution, and partial occlusions.The general framework of pedestrian detection can be decomposed into three modules: (i) generate the region proposals that represent object hypotheses in a test image, (ii) classify the region proposals, and (iii) refine the region proposals to obtain accurate localization of pedestrians.
][3][4][5][6][7][8][9][10] The applicability of the Hough transform framework can be attributed to its robustness against partial occlusions, as indicated in Refs. 1 and 3-5.Another attractive property of the Hough transform is its simplicity.The Hough transform framework for pedestrian detection includes three primary steps: (i) construct visual codebook, (ii) cast probabilistic votes for object center into a Hough image according to the codebook using voting elements of the test image, and (iii) search maxima in the Hough image as object hypotheses.Although some Hough transform methods demonstrate the significance of the visual codebook and voting weights 1,2,4 for detection performance, none use contextual information.Voting elements, which denote the image patches classified into object categories, cast probabilistic votes into a Hough image.
However, the image patch contains only partial information about an object, and its appearance is highly variable.Thus, it is difficult to disambiguate object patches from background patches by a classifier at the local level.Therefore, detection performance can be reduced due to noisy votes cast by background patches.Fortunately, conditional random field (CRF) frameworks modeling context have achieved an impressive performance for semantic segmentation, [11][12][13][14][15] image classification, 16 saliency detection, 17 and object detection. 18The CRF distribution can be formulated by a probabilistic graphical model, in which variables are interdependent rather than independent.Given an image, CRF inference is performed by a maximum a posteriori (MAP) or maximum posterior marginal criterion, and all patches can be classified into an object category or background simultaneously.In other words, the CRF model uses whole image information instead of local information to obtain all patch labels.
In this paper, we build a CRF model that regards the locality-constrained linear coding (LLC) 19 code of a local feature as a latent variable, which is more informative than the corresponding local feature.In addition, we apply a Gaussian kernel to neighboring features to measure the strength of pairwise energy in the CRF framework.In the training stage, we iteratively modulate the codebook and CRF model parameters by a max-margin approach with a maximum-likelihood criterion.Furthermore, to learn the spatialoccurrence distribution of the codebook, offset vectors of the local feature to its object center in a training image are assigned to matching codewords.In the detection stage, all image patches are classified into an object category or background simultaneously by CRF inference, and the patches classified into an object category are used as voting elements in the Hough transform.The voting element casts weighted votes into the Hough image according to its LLC coefficients on codewords, and the use of LLC enables us to reduce the reconstruction error for representing the voting element by a linear combination of codewords. 20This may result in more balanced probabilistic votes than uniform votes in the Hough image.Maxima are regarded as object hypotheses in the Hough image, in which all votes accumulate.The proposed method makes three main contributions: • It optimizes the codebook through CRF learning.
• It casts weighted votes into the Hough image by the encoding strategy.We evaluated our method on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets.This work compromises speed, accuracy, and simplicity.Experiments demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods, benefiting from the contextual information in images and the weighted Hough voting strategy.The rest of the paper is structured as follows.We review literature on the Hough transform methods, encoding methods, and CRF in Sec. 2. We describe our method for pedestrian detection in Sec. 3. We evaluate the proposed method on several challenging datasets in Sec. 4, and we provide our conclusions in Sec. 5.

Related Work
In this section, we first discuss the Hough transform-based methods for pedestrian detection and then briefly describe encoding methods and CRF that are related to the proposed method.
In the past years, applications of the methods based on the Hough transform framework have resulted in progress in pedestrian detection.The majority of Hough transform methods usually focus on codebook learning, voting element generation, and hypotheses search.4][5] The implicit shaped model (ISM) 1 has been widely derived by other Hough transformbased methods, which constructs a visual codebook by clustering local features in an unsupervised manner.Gall and Lempitsky 2 proposed the Hough forest to build decision trees in a supervised manner, where a set of leaves can be regarded as a discriminative codebook that produces probabilistic votes with better voting performance.Barinova et al. 4 proposed an MAP inference method rather than nonmaximum suppression (NMS) to seek the maxima in the Hough image.Wang et al. 5 proposed a structured Hough transform method that incorporates depth-dependent contexts into a codebook-based pedestrian detection model.
Cabrera and Lpez-Sastre 6 proposed a boosted Hough forest, in which decision trees are trained in a stage-wise fashion to optimize a global loss function.Liu et al. 9 proposed a pair Hough model (PHM) for detecting objects whose voting elements were extracted from interest points to handle the rotation of objects.In a study by Liu et al., 10 extremely randomized trees (ERTs) were constructed from features of soft-labeled training blobs, and a Hough image was accumulated by votes from features based on the soft-labeled ERTs.Different from other Hough transform methods, the proposed method regards LLC codes as hidden variables in a unified CRF framework that exploits the contextual information between neighboring image patches, from which the visual codebook and CRF parameters are learned in a supervised manner.

Encoding Methods
Many approaches for encoding local features (image patches) have been proposed. 19,20,40Lazebnik et al. 40 proposed spatial pyramid matching (SPM), which is a simple and computationally efficient extension of an orderless bag-of-features image representation.Yang et al. 20 developed an extension of the SPM method called ScSPM for nonlinear codes.Wang et al. 19 proposed LLC in place of the vector quantization (VQ) coding in traditional SPM utilizing the locality constraint to project each local feature into its local coordinate system.Moreover, dictionary learning plays a significant role in encoding. 17,41Bach et al. 41 demonstrated that better results can be obtained when dictionary is modulated to the specific task.Yang and Yang 17 proposed a top-down saliency model that jointly learns a discriminative dictionary and a CRF to improve sparse coding (SC).However, codebooks optimized in these methods are utilized for image classification or saliency detection rather than Hough transform-based pedestrian detection.
The LLC can represent local features by codewords with lower reconstruction error than VQ 42 and SC. 20This property of LLC motivated us to utilize the code coefficients of a voting element as codeword weights to cast better balanced votes in the Hough image.

Locality-constrained linear coding
Feature encoding decomposes a local feature x into a linear combination of codewords over the predefined codebook C ¼ ½c 1 ; c 2 ; : : : ; c M ∈ R N×M , where c i denotes the i'th codeword that is N-dimensional.While the SC 20 method applies a sparsity constraint to select similar codewords of local features from a codebook, the LLC method 19 incorporates a locality constraint that must lead to a sparsity constraint but not necessarily vice versa.The visual information of image patches contained in the codebook is transferred into the latent variables of the CRF model by the LLC, which is more informative than local features.The LLC code of a local feature x is obtained by solving the following optimization problem: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 1 3 8   Lðx; CÞ ¼ arg min where ⊙ denotes the element-wise multiplication, λ is used to control the locality constraint, l is the vector of weights corresponding to the codewords, and d ∈ R M is the locality adaptor that corresponds to the similarities between the codewords and local feature x.Specifically E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 7 3 0 d ¼ exp distðx; CÞ σ ; where distðx; CÞ ¼ ½distðx; c 1 Þ; : : : ; distðx; c M Þ ⊤ , and distðx; c i Þ denotes the Euclidean distance between x and c i .σ denotes the weight-decay speed for the locality adaptor.Note that the LLC code in Eq. ( 1) is not sparse in the sense of the l 0 norm, but it is sparse in the sense that the solution has few significant values.In the LLC method, the solution of the optimization problem can be translated into the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 5 9 8   Lðx; where ðC − 1x ⊤ ÞðC − 1x ⊤ Þ ⊤ denotes the data covariance matrix, \ denotes matrix left division, λ is a parameter controlling the locality constraint, and where / denotes the division.Equation ( 4) is used for vector unitization.

Conditional Random Field
A CRF is a flexible framework for modeling contextual information that can be grouped into three levels: pixels, patches, and objects.It is widely used for image semantic segmentation and patch-level labeling [11][12][13][14][15]18 by addressing computer vision problems with CRF inference. Kumr and Hebert 18 proposed the discriminative random field, which inherits the CRF concept for labeling man-made structures at patch level.To disambiguate local image information, He et al. 11 proposed a multi-CRF with three separate components at different scales for image semantic segmentation.Quattoni et al. 16 proposed a hidden-state CRF for image classification that models the latent structure of the input domain via intermediate hidden variables. Tooda and Hasegawa 12 proposed a CRF incorporating local and global image information.Thus, global consistency of layouts is achieved from a global viewpoint.Shotton et al. 13 proposed a CRF model for semantic segmentation that uses a texture-layout filter incorporating texture, layout, and contextual information.Owing to the need to solve excessive boundary smoothing for semantic segmentation using an adjacency CRF structure, Krähenbühl and Koltun 14 proposed a fully connected CRF that establishes pairwise potentials consisting of a linear combination of Gaussian kernels on all pairs of pixels in the image. Chn et al. 15 proposed a DeepLab system that utilizes a fully connected CRF coupled with a deep convolutional network-based pixel-level classifier as well as long range dependencies to capture fine edge details.Yang and Yang 17 proposed a top-down saliency model by constructing a CRF upon SC of image patches; the codebook was optimized by jointly learning the CRF model.To speed-up the saliency detection procedure, Yang and Xiong 43 proposed a saliency detection method by combining LLC and CRF.While these saliency detection methods use CRF to generate saliency maps directly, the proposed method builds the CRF model to obtain Hough voting elements. Th CRF 13,18 is a conditional distribution over the labels Y ¼ fy i g i∈S given the observations X ¼ fx i g i∈S , which can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 6 9 7

PðYjXÞ
ϕ ij ðy i ; y j jXÞ ; ( where Z is a normalizing constant known as the partition function, ϕ i and ϕ ij are the unary and pairwise potentials, respectively, S is a set of sites that refers to elements (pixels or patches) in an image, N i is a set of neighbors of site i, and α is a coefficient that modulates the effect of the pairwise potential ϕ ij .In general, the unary potential ϕ i denotes the penalty for a local classifier applied to an image patch and ignoring its neighbors.The pairwise potential ϕ ij is seen as a penalty of label inconsistency that assumes neighboring pixels or patches should be classified into the same object category.
3 Our Method Our pedestrian detection system consists of two modules: (i) a CRF model with latent variables denoted by LLC codes of image patches.The visual codebook can be optimized by learning this model and can further learn a spatialoccurrence distribution that specifies where each codeword may be found on the object.(ii) A Hough voting module.Patch labels are jointly estimated in a test image by CRF inference, and the patches classified into the object category are voting elements that cast weighted votes into the Hough image.Maxima in the Hough image are regarded as object hypotheses.An overview of the detection procedure is shown in Fig. 1.

Conditional Random Field Model
We exploit the contextual information in an image by a CRF model that uses LLC codes as latent variables and applying a Gaussian kernel to measure the strength of pairwise energy.
where S is a set of sites that refers to patches in an image and N i is a set of neighbors of site i.The unary energy φ i can be measured by the total contribution of sparse codes −y i υ ⊤ 1 l i , where υ 1 ∈ R M is the weight vector and M denotes the number of codewords.The pairwise energy φ ij can be denoted as υ 2 Gðl i ; l j Þμðy i ; y j Þ, where the scalar υ 2 measures the weight of the pairwise energy term, Gðl i ; l j Þ is a Gaussian kernel to measure the strength of pairwise energy, and μ is an indicator function equaling 1 for different labels.The Gaussian kernel is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 1 0 9 Gðl i ; where l i and l j denote the LLC codes of neighboring local features x i and x j , respectively.The degree of similarity is controlled by the parameter θ.
Like most CRF models, [11][12][13] the energy function is linear with the parameter υ ¼ ½υ 1 ; υ 2 , but it is nonlinear with the codebook C, which is implicitly defined by Lðx; CÞ in Sec.2.2.This nonlinear parametrization makes it challenging to learn the model.We discuss the learning approach in Sec.3.2.

Joint CRF and Codebook Learning
Following Yang and Yang's 17 method, we learn the CRF parameters and codebook in accordance with the CRF model.Let X ¼ fX ðkÞ g K k¼1 be a set of K training images and Y ¼ fY ðkÞ g K k¼1 be corresponding set of labels.We aim to estimate the CRF parameter vector υ and the codebook C by maximizing the joint likelihood of training data where L ðkÞ ≜ L½X ðkÞ ; C and C is the convex set of codebooks that satisfies the following constraint: ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 4 8 9 The evaluation of the partition function Z of Eq. ( 6) is an NP-hard problem.Referring to the max-margin CRF learning approach, 44 we look for the optimal weights υ and codebook C that assign the training labels Y ðkÞ , a probability that is greater than or equal to any other labeling Y of instance k The partition function Z can be canceled from both sides of the constraints [Eq.( 7)], and we express the constraints in terms of energies E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 3 3 7 E½Y ðkÞ ; L ðkÞ ; υ ≤ E½Y; L ðkÞ ; υ: Moreover, we desire the energy of ground truth E½Y ðkÞ ; L ðkÞ ; υ to be lower than that of any other energies E½Y; L ðkÞ ; υ of label configurations on the training data.Thus, we have a new constraint set Therefore, the optimal weight υ and the codebook C can be learned by minimizing the following objective function: where l k ðυ; CÞ ≜ E½ ŶðkÞ ; L ðkÞ ; υ − E½Y ðkÞ ; L ðkÞ ; υ and γ controls the regularization of the weight υ.
The above objective function is optimized by a stochastic gradient descent algorithm, which is summarized in Algorithm 1.

Learning the Spatial-Occurrence Distribution
In this section, we learn the nonparametric spatial-occurrence distribution P C for each codeword of the optimized codebook C, which can be used to cast votes into the Hough image in the test stage.An occurrence represents an image patch of the training images, which matches a codeword.As in the other Hough transform methods, 1,4,5 a codeword represents a specific object part whose position relative to the object center is uncertain.Each codeword corresponds to a set of occurrences in the training images.
As shown in Algorithm 2, we perform an iteration over all training images to match the codewords to local features.Here, we activate the codewords whose similarity exceeds a matching threshold of 0.7 (discussed in Sec.4.1).For every codeword, we store all occurrence positions that reflect its spatial distribution over the object area in a nonparametric form (as a list of occurrences).

Weighted Hough Voting Strategy
In Sec.3.3, the visual codebook C was optimized by learning the CRF model iteratively, and voting elements were obtained by CRF inference in the test image.We now describe the Hough voting procedure based on the CRF model that regards the LLC code of an image patch as a latent variable.A flowchart of the detection procedure is shown in Fig. 1.The voting element consistently casts weighted votes into the Hough image according to its LLC code.To locate the objects in the test image, maxima in the Hough image are regarded as object hypotheses.Moreover, to handle scale variations, a test image is resized by a set of scale factors, and hypotheses are computed independently in the Hough images at each scale.
Different from other Hough transform approaches, 1,2,4-6,8-10 our Hough voting procedure is cast into a probabilistic framework with a coding strategy.Let x be the local feature observed at location ˜l in the test image.By matching it to the visual codebook, a set of valid interpretations c i with probabilities pðc i jx; ˜lÞ can be obtained.If a codeword matches, it casts votes for different object positions.That is, for every c i , votes for several object categories O n and a position h can be obtained according to the learned spatial-occurrence distribution pðO n ; hjc i ; ˜lÞ.The voting probability of a local feature can be formally expressed by the following marginalization: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 3 2 6 ; 2 0 6 pðO n ; hjx; ˜lÞ ¼ X i pðO n ; hjx; c i ; ˜lÞpðc i jx; ˜lÞ; for i ¼ 1; : : : ; N, where N is the number of codewords.Since the unknown local feature x has been replaced by a known interpretation c i in the test image, the first term can be considered independent from x. Also, local features matched to the codebook are independent of their location.Thus, the equation is reduced to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 6 3 ; 5 7 9 y vote ¼ y − y˜lðs∕s˜lÞ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 6 3 ; 5 5 4 s vote ¼ s∕s l: (21)   Thus, the voting probability pðhjO n ; c i ; ˜lÞ is obtained by summing the votes for all stored observations from the learned occurrence distribution P c .The ensemble of all such votes is used to obtain a nonparametric probability density estimate for the position of the object center.
The probability pðc i jxÞ of a match between a local feature and codeword is obtained according to the LLC algorithm 19 described above.In other words, the LLC code l ¼ Lðx; CÞ is regarded as weighted probabilities for Hough voting.
Next, maxima are sought to be object hypotheses in the Hough voting space, in which all votes are accumulated.The search process includes two stages.We first accumulate the voting probabilities in a three-dimensional Hough space and find maxima as candidates.We then employ the mean-shift algorithm 1 to refine the locations of hypotheses.Intuitively, the probability pðO n ; hÞ of an object hypothesis is obtained by summing the individual voting probabilities pðO n ; h; x k ; ˜lk Þ over all observations, and we arrive at the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 6 3 ; 3 1 8 pðO n ; hÞ ¼ for k ¼ 1; : : : ; K, where K is the number of local features in the test image.pðx k ; ˜lk Þ is the probability of local feature ðx k ; ˜lk Þ being sampled for object O n located at h.Nonetheless, it is necessary to tolerate small shape deformations to be robust for intraclass variations of the object.Thus, the mean-shift framework 1 is formulated with the following kernel density estimate: ; t e m p : i n t r a l i n k -; e 0 2 3 ; 6 3 ; 1 9 9 where the Gaussian kernel G is a radially symmetric, nonnegative function, centered at zero and integrating to one, b is the kernel bandwidth, and V b is its volume.The meanshift search using this formulation will quickly converge to local modes of the underlying distribution.Moreover, the search procedure can be interpreted as kernel density estimation for the position of the object center.
Candidates of objects with high scores are usually close to each other in the Hough image.This may lead to the same object corresponding to multiple candidates, resulting in false positives.To reduce redundancy, we adopt NMS on the overlapped object hypotheses.We fix the intersection over union (IoU) threshold for NMS at 0.7.

Datasets
To evaluate the effectiveness of the proposed method in different scenes, we choose three publicly available pedestrian datasets, namely, INRIA pedestrian, TUD Brussels, and Caltech pedestrian.Pedestrians in these datasets are mostly upright but are of different degrees of occlusions, and pose and scale changes, together with the variations in background and illuminations.

INRIA Pedestrian
The INRIA pedestrian dataset consists of 614 training images and 288 test images, which is challenging due to the variability of pedestrian poses, illumination changes, and highly cluttered backgrounds (mountains, buildings, vehicles, etc.).

TUD Brussels
The TUD Brussels dataset contains 508 images (one pair per second) at a resolution of 640 × 480, which are recorded from a car driving in the inner city of Brussels.This dataset is challenging due to partial occlusion, cluttered backgrounds (e.g., poles, parked cars, buildings, and crowds), and numerous small-scale pedestrians.

Caltech Pedestrian
The Caltech pedestrian dataset and its associated benchmark are among the most popular pedestrian detection datasets.It consists of about 10 h of videos (30 frames per second) collected from a vehicle driving through urban traffic.Every frame in the Caltech dataset has been densely annotated with the bounding boxes of pedestrian instances.In total, there are 350,000 bounding boxes of about 2300 unique pedestrians labeled in 250,000 frames.The pedestrians in the Caltech pedestrian dataset appear in many positions, orientations, and background variety.In the reasonable evaluation setting, the performance is evaluated on pedestrians over 50-pixels tall with no or partial occlusion.

Experiment Procedure
All experiments are carried out on a workstation equipped with a Titan Xp GPU and an Intel Xeon(R) CPU E5-2620 v4 @ 2.10 GHz.The evaluation tool is based on the codes from the official websites of Caltech and PASCAL VOC.Bounding boxes of objects are predicted in an image at test time.By default, predicted bounding boxes are considered positives when the IoU overlaps by more than 0.5 with ground-truth bounding boxes, and the rest are considered negatives.We use precision recall (PR) curve to evaluate pedestrian datasets. 4,26,28Following, 9,28 we use average precision (AP) to measure detection performance on these datasets, which denotes the area under the PR curve.The AP was calculated in accordance with the criteria of PASCAL VOC.
We densely extract scale-invariant feature transform features from images with a step length of 16 pixels.The codebook is optimized by training the CRF model with 12 iterations.The matching threshold is set to 0.7 for learning the spatial-occurrence distribution of the optimized codebook C (Sec.3.3).The number K of LLC neighbors is set to 20.The codebook size M is set to 512.Implemented on a CPU to detect pedestrians from the Caltech pedestrian dataset, the Hough transform-based ISM 1 and Barinova et al.'s method 4 require 0.48 and 0.55 s per image, respectively, whereas the proposed method requires 0.62 s per image.Our method only requires 0.14 s (per image) extra computational time than ISM, because it mainly benefits from the efficient LLC 19 and inference algorithms in the CRF model.

Result Analysis
Figure 2 shows the PR curves of our method compared to conventional pedestrian detection approaches (HOG, 21 FPDW, 23 CrossTalk, 25 LatSvm-V2, 22 ACF, 30 Roerei, 26 MT-DPM, 27 and NAMC 32 ) on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting.The APs of these methods are shown in Table 1.It can be observed that our method obtained obvious improvements over the Hough transform-based methods 1,4,9 on these datasets.This is mainly attributable to two properties of our method that solve two challenging problems in the INRIA, TUD, and Caltech datasets: (i) the proposed method relies on image patches; hence, it can cope with the partial occlusions that are common in pedestrian datasets and (ii) the CRF model can effectively reduce the voting noise generated by the cluttered background.
We further evaluated the proposed method on three subsets of the Caltech pedestrian dataset according to its evaluation settings ("Occ = none," "Occ = partial," and "Occ = heavy").Pedestrians are full, 65% to 100%, and 20% to 65% on those three settings, respectively.Table 2 shows that our method achieved APs of 66.4%, 47.3%, and 25.5% on these respective evaluation settings.Our method shows obvious improvements over the Hough transform-based methods 1,4,9 on these evaluation settings.
For the TUD pedestrian dataset, we masked ground-truth objects with proportions of 20%, 40%, and 60% from the left to right side, respectively, owing to an absence of occlusion information in this dataset.As shown in Fig. 3, our method has obvious improvements on these masked proportions compared to Hough transform-based ISM 1 and Barinova et al.'s 4 method.
In addition, we verified the significance of codebook optimization, codebook size, number of LLC neighbors, and weighted voting strategy on detection performance.

Impact of the codebook optimization
We initialized the codebook by the K-means clustering algorithm and then optimized the codebook by learning the CRF model.The codebook optimization was driven by top-down prior knowledge in a supervised manner.As shown in Fig. 4(a), detection performance improved rapidly in the first several iterations and converged after 12 iterations.The stochastic nature of the learning algorithm resulted in some performance perturbation in some iterations.

Impact of the matching threshold
At test time, occurrence distributions of the codebook C were used to cast votes into the Hough image for pedestrian detection; thus, they are significant to detection performance of the proposed method.Learning occurrence distributions mainly depends on the matching threshold that represents the similarity between a codeword and an object patch of a training image.Intuitively, the occurrence distributions may be impacted by noise when the matching threshold is set to a relatively low value.On the contrary, the occurrence distributions are likely to lack some important occurrences when the matching threshold is set to a relatively high value.To find the optimal matching threshold, we evaluated the detection performance with different values of the matching threshold.Figure 4(b) shows the detection results on the INRIA pedestrian and TUD Brussels datasets with different values of the matching threshold.We found that our method achieved a relatively high AP when the matching threshold was 0.7.

Impact of the LLC parameter K
To focus on the impact of the number K of LLC neighbors, the codebook size was fixed at 512.As shown in Fig. 4(c), detection performance improved dramatically when K was <15, and it converged when K was >20.The experimental results show that the number of LLC neighbors had a great impact on detection performance.

Impact of the codebook size
To investigate the impact of codebook size on detection performance, we compared detection performance with codebook sizes of 256 and 512, with the parameter K of LLC fixed at 20.As shown in Table 3, the AP was 92.6% when M ¼ 256 on the INRIA pedestrian dataset and 94.4% when M ¼ 512.The AP was 62.7% when M ¼ 256 on the TUD Brussels dataset and 67.1% when M ¼ 512.We found that M ¼ 512 gives better detection results than M ¼ 256.

Performance of the weighted voting strategy
As for the weighted voting strategy (Sec.3.4), we used the LLC coefficients instead of uniform weights as voting weights on codewords.The codebook size was fixed at 512.The parameter K of LLC was fixed at 20.As shown in Table 4, the APs of the weighted voting were 4.0% and 2.9% higher, respectively, than the uniform voting on the INRIA pedestrian and TUD Brussels datasets.

Effectiveness of the CRF model using the deep convolutional features
To investigate the effectiveness of the CRF model in detecting pedestrians using the deep convolutional features, we capture contextual relationships on the high-quality object candidates provided by the method RPN + BF. 36 The region of interest (RoI) features of size 512 × 7 × 7 are naturally extracted from the object candidates in the feature maps as in Ref. 36.An object candidate is regarded as a node in the CRF model within a fully connected form.The unary potential of the CRF model is the cost of the confidence score on an object candidate outputted by RPN + BF, which denotes the inverse likelihood of an object candidate taking the label of pedestrian.The pairwise potential relies on the RoI features of a pair of object candidates, which measures the cost of similar object candidates with different labels (e.g., the binary labels, pedestrian, and background) as in Refs.48 and 49.We feed the RoI features of object candidates of all test images into the CRF model.Finally, the marginal probability distributions of all object candidates can be simultaneously obtained using the mean field inference in the CRF model.The PR curves are obtained by utilizing the marginal probabilities (as the confidence scores) of the pedestrian label, rather than utilizing the initial confidence scores provided by RPN + BF.In Fig. 5, it can be observed that the CRF model achieved APs of 98.7% and 93.2% on the INRIA and Caltech datasets, respectively, which obtains improvements of 1.3% and 2.2% over the RPN + BF.

13 )
e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 2 6 1 E½Y ðkÞ ; L ðkÞ ; υ ≤ E½Y; L ðkÞ ; υ − Δ½Y; Y ðkÞ : (The margin function Δ½Y; Y ðkÞ ¼ P m i¼1 I½y i ; y ðkÞ i , where I is an indicator function equal to 1 for different labels.There are an exponential number of constraints with respect to labeling Y ðkÞ for each training image.Inspired by the cutting plane algorithm, 45 the most violated constraints can be found by solving E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 3 2 6 ; 1 6 2 ŶðkÞ ¼ arg min Y E½Y; L ðkÞ ; υ − Δ½Y; Y ðkÞ :

2 minFig. 1
Fig. 1 Overview of the detection procedure.Local features (image patches) are densely extracted from the input image and encoded by LLC as latent variables in the CRF model; the codebook, as a visual dictionary, represents a set of object parts; all patches in the input image are classified into the object category or background simultaneously by CRF inference.The label field indicates a set of category labels on all image patches.Image patches classified into the object category are regarded as voting elements.A voting element casts weighted votes into the Hough image by its LLC code.A Hough image was accumulated by votes from voting elements.Maxima in the Hough image are regarded as object hypotheses.Best viewed in color.

Fig. 2
Fig. 2 Detection performance comparisons of our method and other methods on the (a) INRIA, (b) TUD Brussels, and (c) Caltech pedestrian datasets according to the reasonable setting.Best viewed in color.

Fig. 3
Fig.3Detection performance comparisons of our method and other methods on the TUD Brussels dataset with several masked proportions (none, 20%, 40%, and 60%).Our method achieved APs of 67.1%, 57.9%, 45.5%, and 29.6% on these respective masked proportions, which shows obvious improvements over the other Hough transform-based methods.

5 Conclusion
In this work, we propose a pedestrian detection method that integrates context modeling and weighted voting strategy in a unified Hough transform framework.The noisy votes from background patches can be reduced by exploiting contextual information on image patches in an image.The coding coefficients based on the optimized codebook contribute to casting highly balanced votes in the Hough image.The experimental results on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods.In future studies, we intend to exploit contextual information among multiple images for pedestrian detection since the contextual information that we try to exploit in this work is only from a single image.

Fig. 5
Fig. 5 Detection performance comparisons on the (a) INRIA and (b) Caltech pedestrian datasets according to the reasonable setting.Best viewed in color.

•
It jointly classifies all image patches into an object category or background according to the CRF model, which includes patch-level contextual constraints.
; c i ; ˜lÞpðO n jc i ; ˜lÞpðc i jxÞ;(18)where pðhjO n ; c i ; ˜lÞ is the voting probability for an object position given its category label O n , codeword c i , and location ˜l.The probability pðO n jc i ; ˜lÞ denotes the confidence that the codeword is matched on the object category O n against the background.Finally, pðc i jxÞ denotes the probability that local feature x matches to codeword c i .The object scale is regarded as a third dimension in the voting space.If a local feature extracted from location ðx; y; sÞ matches a codeword that has been observed at position ðx˜l; y˜l; s˜lÞ on a training image, it votes for the following coordinates: e m p : i n t r a l i n k -; e 0 1 7 ; 3 2 6 ; 9 8 pðO n ; hjx; ˜lÞ ¼ X i pðO n ; hjc i ; ˜lÞpðc i jxÞ;(17) Algorithm 1 Joint CRF and codebook learning 1: Input: X (training images) and Y (patch labels); C ð0Þ (initial dictionary); υ ð0Þ (initial CRF weight vector); T (number of iterations); K (number of training images).2: Output: the codebook C and the weight υ. 3: for t ¼ 1 to T do 10: for j ¼ 1 to J do// J local features in image X ðkÞ .11: Let x j be the local feature at location (l x , l y , l s ).12: for m ¼ 1 to M do 13: if similarity ðc m ; x j Þ ≥ t then 14: //Record an occurrence of codeword c m 15: U½m ¼ U½m ∪ ðo x − l x ; o y − l y ; l s Þ i pðhjO n vote ¼ x − x lðs∕s lÞ;

Table 1
Performance comparison in terms of AP (%) on the INRIA, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting.

Table 2
Detection performance comparisons of our method and other methods on three Caltech evaluation settings ("Occ = none," "Occ = partial," and "Occ = heavy").

Table 3
Performance comparison in terms of codebook size M on the TUD Brussels and INRIA pedestrian datasets.Note: The bold values denote the best detection performances in terms of AP.

Table 4
Performance comparison in terms of voting strategies on the TUD Brussels and INRIA pedestrian datasets.Note: The bold values denote the best detection performances in terms of AP.