## 1.

## Introduction

Pedestrian detection is a fundamental challenge in computer vision due to great variation in appearance, changes in illumination, poor resolution, and partial occlusions. The general framework of pedestrian detection can be decomposed into three modules: (i) generate the region proposals that represent object hypotheses in a test image, (ii) classify the region proposals, and (iii) refine the region proposals to obtain accurate localization of pedestrians.

In the past years, the use of Hough transform framework has attracted considerable attention for pedestrian detection.^{1}2.3.4.5.6.7.8.9.^{–}^{10} The applicability of the Hough transform framework can be attributed to its robustness against partial occlusions, as indicated in Refs. 1 and 34.–5. Another attractive property of the Hough transform is its simplicity. The Hough transform framework for pedestrian detection includes three primary steps: (i) construct visual codebook, (ii) cast probabilistic votes for object center into a Hough image according to the codebook using voting elements of the test image, and (iii) search maxima in the Hough image as object hypotheses. Although some Hough transform methods demonstrate the significance of the visual codebook and voting weights^{1}^{,}^{2}^{,}^{4} for detection performance, none use contextual information. Voting elements, which denote the image patches classified into object categories, cast probabilistic votes into a Hough image.

However, the image patch contains only partial information about an object, and its appearance is highly variable. Thus, it is difficult to disambiguate object patches from background patches by a classifier at the local level. Therefore, detection performance can be reduced due to noisy votes cast by background patches. Fortunately, conditional random field (CRF) frameworks modeling context have achieved an impressive performance for semantic segmentation,^{11}12.13.14.^{–}^{15} image classification,^{16} saliency detection,^{17} and object detection.^{18} The CRF distribution can be formulated by a probabilistic graphical model, in which variables are interdependent rather than independent. Given an image, CRF inference is performed by a maximum a posteriori (MAP) or maximum posterior marginal criterion, and all patches can be classified into an object category or background simultaneously. In other words, the CRF model uses whole image information instead of local information to obtain all patch labels.

In this paper, we build a CRF model that regards the locality-constrained linear coding (LLC)^{19} code of a local feature as a latent variable, which is more informative than the corresponding local feature. In addition, we apply a Gaussian kernel to neighboring features to measure the strength of pairwise energy in the CRF framework. In the training stage, we iteratively modulate the codebook and CRF model parameters by a max-margin approach with a maximum-likelihood criterion. Furthermore, to learn the spatial-occurrence distribution of the codebook, offset vectors of the local feature to its object center in a training image are assigned to matching codewords. In the detection stage, all image patches are classified into an object category or background simultaneously by CRF inference, and the patches classified into an object category are used as voting elements in the Hough transform. The voting element casts weighted votes into the Hough image according to its LLC coefficients on codewords, and the use of LLC enables us to reduce the reconstruction error for representing the voting element by a linear combination of codewords.^{20} This may result in more balanced probabilistic votes than uniform votes in the Hough image. Maxima are regarded as object hypotheses in the Hough image, in which all votes accumulate. The proposed method makes three main contributions:

• It optimizes the codebook through CRF learning.

• It casts weighted votes into the Hough image by the encoding strategy.

• It jointly classifies all image patches into an object category or background according to the CRF model, which includes patch-level contextual constraints.

We evaluated our method on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets. This work compromises speed, accuracy, and simplicity. Experiments demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods, benefiting from the contextual information in images and the weighted Hough voting strategy. The rest of the paper is structured as follows. We review literature on the Hough transform methods, encoding methods, and CRF in Sec. 2. We describe our method for pedestrian detection in Sec. 3. We evaluate the proposed method on several challenging datasets in Sec. 4, and we provide our conclusions in Sec. 5.

## 2.

## Related Work

In this section, we first discuss the Hough transform-based methods for pedestrian detection and then briefly describe encoding methods and CRF that are related to the proposed method.

## 2.1.

### Hough Transform Methods

There is extensive literature dedicated to pedestrian detection.^{21}22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.^{–}^{39} Here, we review the methods based on the Hough transform framework^{1}^{,}^{2}^{,}^{4}5.^{–}^{6}^{8}9.^{–}^{10} that are most relevant to our work.

In the past years, applications of the methods based on the Hough transform framework have resulted in progress in pedestrian detection. The majority of Hough transform methods usually focus on codebook learning, voting element generation, and hypotheses search. The advantage of the Hough transform methods is that they can detect pedestrians with low computational cost due to the simple structure^{9} and can also locate a partially occluded pedestrian in an image using a small set of local patches.^{1}^{,}^{3}4.^{–}^{5} The implicit shaped model (ISM)^{1} has been widely derived by other Hough transform-based methods, which constructs a visual codebook by clustering local features in an unsupervised manner. Gall and Lempitsky^{2} proposed the Hough forest to build decision trees in a supervised manner, where a set of leaves can be regarded as a discriminative codebook that produces probabilistic votes with better voting performance. Barinova et al.^{4} proposed an MAP inference method rather than nonmaximum suppression (NMS) to seek the maxima in the Hough image. Wang et al.^{5} proposed a structured Hough transform method that incorporates depth-dependent contexts into a codebook-based pedestrian detection model. Cabrera and Lpez-Sastre^{6} proposed a boosted Hough forest, in which decision trees are trained in a stage-wise fashion to optimize a global loss function. Liu et al.^{9} proposed a pair Hough model (PHM) for detecting objects whose voting elements were extracted from interest points to handle the rotation of objects. In a study by Liu et al.,^{10} extremely randomized trees (ERTs) were constructed from features of soft-labeled training blobs, and a Hough image was accumulated by votes from features based on the soft-labeled ERTs. Different from other Hough transform methods, the proposed method regards LLC codes as hidden variables in a unified CRF framework that exploits the contextual information between neighboring image patches, from which the visual codebook and CRF parameters are learned in a supervised manner.

## 2.2.

### Encoding Methods

Many approaches for encoding local features (image patches) have been proposed.^{19}^{,}^{20}^{,}^{40} Lazebnik et al.^{40} proposed spatial pyramid matching (SPM), which is a simple and computationally efficient extension of an orderless bag-of-features image representation. Yang et al.^{20} developed an extension of the SPM method called ScSPM for nonlinear codes. Wang et al.^{19} proposed LLC in place of the vector quantization (VQ) coding in traditional SPM utilizing the locality constraint to project each local feature into its local coordinate system. Moreover, dictionary learning plays a significant role in encoding.^{17}^{,}^{41} Bach et al.^{41} demonstrated that better results can be obtained when dictionary is modulated to the specific task. Yang and Yang^{17} proposed a top-down saliency model that jointly learns a discriminative dictionary and a CRF to improve sparse coding (SC). However, codebooks optimized in these methods are utilized for image classification or saliency detection rather than Hough transform-based pedestrian detection.

The LLC can represent local features by codewords with lower reconstruction error than VQ^{42} and SC.^{20} This property of LLC motivated us to utilize the code coefficients of a voting element as codeword weights to cast better balanced votes in the Hough image.

## 2.2.1.

#### Locality-constrained linear coding

Feature encoding decomposes a local feature $\mathbf{x}$ into a linear combination of codewords over the predefined codebook $\mathbf{C}=[{\mathbf{c}}_{1},{\mathbf{c}}_{2},\dots ,{\mathbf{c}}_{M}]\in {\mathbb{R}}^{N\times M}$, where ${\mathbf{c}}_{i}$ denotes the $i$’th codeword that is $N$-dimensional. While the SC^{20} method applies a sparsity constraint to select similar codewords of local features from a codebook, the LLC method^{19} incorporates a locality constraint that must lead to a sparsity constraint but not necessarily vice versa. The visual information of image patches contained in the codebook is transferred into the latent variables of the CRF model by the LLC, which is more informative than local features. The LLC code of a local feature $\mathbf{x}$ is obtained by solving the following optimization problem:

## (1)

$$L(\mathbf{x},\mathbf{C})=\mathrm{arg}\text{\hspace{0.17em}}\underset{\mathbf{l}}{\mathrm{min}}{\Vert \mathbf{x}-\mathbf{Cl}\Vert}^{2}+\lambda {\Vert \mathbf{d}\odot \mathbf{l}\Vert}^{2}\phantom{\rule{0ex}{0ex}}\mathrm{s.t.}\text{\hspace{0.17em}\hspace{0.17em}}{\mathbf{1}}^{\top}\mathbf{l}=1,$$## (3)

$$\tilde{L}(\mathbf{x},\mathbf{C})=[(\mathbf{C}-{\mathbf{1}\mathbf{x}}^{\top}){(\mathbf{C}-{\mathbf{1}\mathbf{x}}^{\top})}^{\top}+\lambda \mathrm{diag}(\mathbf{d})]\setminus \mathbf{1},$$## 2.3.

### Conditional Random Field

A CRF is a flexible framework for modeling contextual information that can be grouped into three levels: pixels, patches, and objects. It is widely used for image semantic segmentation and patch-level labeling^{11}12.13.14.^{–}^{15}^{,}^{18} by addressing computer vision problems with CRF inference. Kumar and Hebert^{18} proposed the discriminative random field, which inherits the CRF concept for labeling man-made structures at patch level. To disambiguate local image information, He et al.^{11} proposed a multi-CRF with three separate components at different scales for image semantic segmentation. Quattoni et al.^{16} proposed a hidden-state CRF for image classification that models the latent structure of the input domain via intermediate hidden variables. Toyoda and Hasegawa^{12} proposed a CRF incorporating local and global image information. Thus, global consistency of layouts is achieved from a global viewpoint. Shotton et al.^{13} proposed a CRF model for semantic segmentation that uses a texture-layout filter incorporating texture, layout, and contextual information. Owing to the need to solve excessive boundary smoothing for semantic segmentation using an adjacency CRF structure, Krähenbühl and Koltun^{14} proposed a fully connected CRF that establishes pairwise potentials consisting of a linear combination of Gaussian kernels on all pairs of pixels in the image. Chen et al.^{15} proposed a DeepLab system that utilizes a fully connected CRF coupled with a deep convolutional network-based pixel-level classifier as well as long range dependencies to capture fine edge details. Yang and Yang^{17} proposed a top-down saliency model by constructing a CRF upon SC of image patches; the codebook was optimized by jointly learning the CRF model. To speed-up the saliency detection procedure, Yang and Xiong^{43} proposed a saliency detection method by combining LLC and CRF. While these saliency detection methods use CRF to generate saliency maps directly, the proposed method builds the CRF model to obtain Hough voting elements.

The CRF^{13}^{,}^{18} is a conditional distribution over the labels $\mathbf{Y}={\{{y}_{i}\}}_{i\in S}$ given the observations $\mathbf{X}={\{{\mathbf{x}}_{i}\}}_{i\in S}$, which can be written as

## (5)

$$P(\mathbf{Y}|\mathbf{X})=\frac{1}{Z}\text{\hspace{0.17em}}\mathrm{exp}\{\sum _{i\in S}{\varphi}_{i}({y}_{i}|\mathbf{X})+\alpha \sum _{i\in S}\sum _{j\in {N}_{i}}{\varphi}_{ij}({y}_{i},{y}_{j}|\mathbf{X})\},$$## 3.

## Our Method

Our pedestrian detection system consists of two modules: (i) a CRF model with latent variables denoted by LLC codes of image patches. The visual codebook can be optimized by learning this model and can further learn a spatial-occurrence distribution that specifies where each codeword may be found on the object. (ii) A Hough voting module. Patch labels are jointly estimated in a test image by CRF inference, and the patches classified into the object category are voting elements that cast weighted votes into the Hough image. Maxima in the Hough image are regarded as object hypotheses. An overview of the detection procedure is shown in Fig. 1.

## 3.1.

### Conditional Random Field Model

We exploit the contextual information in an image by a CRF model that uses LLC codes as latent variables and applying a Gaussian kernel to measure the strength of pairwise energy. This model is used for two purposes: (i) to optimize the codebook by learning the CRF model and (ii) to jointly classify image patches into the object category or background by CRF inference. To reduce Hough image noise resulting from background patches, image patches classified into the object category are used as voting elements (Sec. 3.4).

Yang and Yang^{17} developed a CRF model upon SC of image patches for saliency detection. Inspired by this CRF model, we build a CRF framework for modeling the context constraint that uses a Gaussian kernel to measure the local feature similarity between neighboring nodes for pairwise energy

## (6)

$$P[\mathbf{Y}|L(\mathbf{X},\mathbf{C}),\mathbf{\upsilon}]=\frac{1}{Z}{e}^{-E[L(\mathbf{X},\mathbf{C}),\mathbf{Y},\mathbf{\upsilon}]},$$## (7)

$$E(\mathbf{L},\mathbf{Y},\mathbf{\upsilon})=\sum _{i\in S}{\phi}_{i}({\mathbf{l}}_{i},{y}_{i},{\mathbf{\upsilon}}_{1})+\sum _{i\in S}\sum _{j\in {N}_{i}}{\phi}_{ij}({\mathbf{l}}_{i},{\mathbf{l}}_{j},{y}_{i},{y}_{j},{\mathbf{\upsilon}}_{2}),$$## (8)

$$G({\mathbf{l}}_{i},{\mathbf{l}}_{j})=\mathrm{exp}(-\frac{{|{\mathbf{l}}_{i}-{\mathbf{l}}_{j}|}^{2}}{2{\theta}^{2}}),$$Like most CRF models,^{11}12.^{–}^{13} the energy function is linear with the parameter $\mathbf{\upsilon}=[{\mathbf{\upsilon}}_{1};{\mathbf{\upsilon}}_{2}]$, but it is nonlinear with the codebook $\mathbf{C}$, which is implicitly defined by $L(\mathbf{x},\mathbf{C})$ in Sec. 2.2. This nonlinear parametrization makes it challenging to learn the model. We discuss the learning approach in Sec. 3.2.

## 3.2.

### Joint CRF and Codebook Learning

Following Yang and Yang’s^{17} method, we learn the CRF parameters and codebook in accordance with the CRF model. Let $\mathcal{X}={\{{\mathbf{X}}^{(k)}\}}_{k=1}^{K}$ be a set of $K$ training images and $\mathcal{Y}={\{{\mathbf{Y}}^{(k)}\}}_{k=1}^{K}$ be corresponding set of labels. We aim to estimate the CRF parameter vector $\mathbf{\upsilon}$ and the codebook $\mathbf{C}$ by maximizing the joint likelihood of training data

## (9)

$$\underset{\mathbf{\upsilon}\in {\mathbb{R}}^{M+1},\mathbf{C}\in \mathcal{C},{\mathbf{L}}^{(k)}}{\mathrm{max}}\prod _{\mathrm{k}=1}^{K}P\{{\mathbf{Y}}^{(k)}|L[{\mathbf{X}}^{(k)},\mathbf{C}],\mathbf{\upsilon}\},$$## (10)

$$\mathcal{C}=\{\mathbf{C}\in {\mathbb{R}}^{N\times M},{\Vert {\mathbf{c}}_{i}\Vert}_{2}\le 1,\forall i=\mathrm{1,2},\dots ,M\}.$$The evaluation of the partition function $Z$ of Eq. (6) is an NP-hard problem. Referring to the max-margin CRF learning approach,^{44} we look for the optimal weights $\mathbf{\upsilon}$ and codebook $\mathbf{C}$ that assign the training labels ${\mathbf{Y}}^{(k)}$, a probability that is greater than or equal to any other labeling $\mathbf{Y}$ of instance $k$

## (11)

$$P[{\mathbf{Y}}^{(k)}|{\mathbf{L}}^{(k)},\mathbf{\upsilon}]\ge P[\mathbf{Y}|{\mathbf{L}}^{(k)},\mathbf{\upsilon}]\phantom{\rule[-0.0ex]{1.0em}{0.0ex}}\forall \mathbf{Y}\setminus {\mathbf{Y}}^{(k)}\phantom{\rule[-0.0ex]{1.0em}{0.0ex}}\forall k.$$The partition function $Z$ can be canceled from both sides of the constraints [Eq. (7)], and we express the constraints in terms of energies

## (12)

$$E[{\mathbf{Y}}^{(k)},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]\le E[\mathbf{Y},{\mathbf{L}}^{(k)},\mathbf{\upsilon}].$$Moreover, we desire the energy of ground truth $E[{\mathbf{Y}}^{(k)},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]$ to be lower than that of any other energies $E[\mathbf{Y},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]$ of label configurations on the training data. Thus, we have a new constraint set

## (13)

$$E[{\mathbf{Y}}^{(k)},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]\le E[\mathbf{Y},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]-\mathrm{\Delta}[\mathbf{Y},{\mathbf{Y}}^{(k)}].$$The margin function $\mathrm{\Delta}[\mathbf{Y},{\mathbf{Y}}^{(k)}]=\sum _{i=1}^{m}\mathrm{I}[{y}_{i},{y}_{i}^{(k)}]$, where $\mathrm{I}$ is an indicator function equal to 1 for different labels. There are an exponential number of constraints with respect to labeling ${\mathbf{Y}}^{(k)}$ for each training image. Inspired by the cutting plane algorithm,^{45} the most violated constraints can be found by solving

## (14)

$${\widehat{\mathbf{Y}}}^{(k)}=\mathrm{arg}\text{\hspace{0.17em}}\underset{\mathbf{Y}}{\mathrm{min}}\text{\hspace{0.17em}}E[\mathbf{Y},{\mathbf{L}}^{(k)},\mathbf{\upsilon}]-\mathrm{\Delta}[\mathbf{Y},{\mathbf{Y}}^{(k)}].$$Therefore, the optimal weight $\mathbf{\upsilon}$ and the codebook $\mathbf{C}$ can be learned by minimizing the following objective function:

## (15)

$$\underset{\mathbf{\upsilon},\mathbf{C}\in \mathcal{C}}{\mathrm{min}}\frac{\gamma}{2}{\Vert \mathbf{\upsilon}\Vert}^{2}+\sum _{k=1}^{K}{\ell}^{k}(\mathbf{\upsilon},\mathbf{C}),$$The above objective function is optimized by a stochastic gradient descent algorithm, which is summarized in Algorithm 1.

## Algorithm 1

Joint CRF and codebook learning

1: Input: $\mathcal{X}$ (training images) and $\mathcal{Y}$ (patch labels); ${\mathbf{C}}^{(0)}$ (initial dictionary); ${\mathbf{\upsilon}}^{(0)}$ (initial CRF weight vector); $T$ (number of iterations); $K$ (number of training images). |

2: Output: the codebook $\mathbf{C}$ and the weight $\mathbf{\upsilon}$. |

3: for$t=1$ to $T$do |

4: Permute training samples $(\mathcal{X},\mathcal{Y})$ |

5: For$k=1$ to $K$do |

6: Evaluate the latent variables ${\mathbf{l}}_{i}$ by Eq. (1) |

7: Solve the most violated labeling ${\widehat{\mathbf{Y}}}^{(k)}$ by Eq. (14) |

8: Update the weight ${\mathbf{\upsilon}}^{t}$ and codebook ${\mathbf{C}}^{t}$ by the loss function ${\ell}^{k}(\mathbf{\upsilon},\mathbf{C})$ |

9: end for |

10: end for |

## 3.3.

### Learning the Spatial-Occurrence Distribution

In this section, we learn the nonparametric spatial-occurrence distribution ${P}_{C}$ for each codeword of the optimized codebook $\mathbf{C}$, which can be used to cast votes into the Hough image in the test stage. An occurrence represents an image patch of the training images, which matches a codeword. As in the other Hough transform methods,^{1}^{,}^{4}^{,}^{5} a codeword represents a specific object part whose position relative to the object center is uncertain. Each codeword corresponds to a set of occurrences in the training images.

As shown in Algorithm 2, we perform an iteration over all training images to match the codewords to local features. Here, we activate the codewords whose similarity exceeds a matching threshold of 0.7 (discussed in Sec. 4.1). For every codeword, we store all occurrence positions that reflect its spatial distribution over the object area in a nonparametric form (as a list of occurrences).

## Algorithm 2

Learning the spatial-occurrence distribution

1: Input: $\mathcal{X}$ (training images); $K$ (number of training images); $\mathbf{C}$ (the codebook learned in Algorithm 1); $M$ (number of codewords). |

2: Output: the occurrences $U$. |

3: //$U[m]$, a list of occurrences, denotes the spatial distribution of codeword ${\mathbf{c}}_{m}$ in a nonparametric manner. |

4: for$m=1$ to $M$do |

5: $U[m]=\xd8$//Initialize occurrences for codeword ${\mathbf{c}}_{m}$. |

6: end for |

7: for$k=1$ to $K$do |

8: Let (${o}_{x}$, ${o}_{y}$) be the object center. |

9: Extract local features in image ${\mathbf{X}}^{(k)}$. |

10: for$j=1$ to $J$do// $J$ local features in image ${\mathbf{X}}^{(k)}$. |

11: Let ${\mathbf{x}}_{j}$ be the local feature at location (${l}_{x}$, ${l}_{y}$, ${l}_{s}$). |

12: for$m=1$ to $M$do |

13: if similarity $({\mathbf{c}}_{m},{\mathbf{x}}_{j})\ge t$then |

14: //Record an occurrence of codeword ${\mathbf{c}}_{m}$ |

15: $U[m]=U[m]\cup ({o}_{x}-{l}_{x},{o}_{y}-{l}_{y},{l}_{s})$ |

16: end if |

17: end for |

18: end for |

19: end for |

## 3.4.

### Weighted Hough Voting Strategy

In Sec. 3.3, the visual codebook $\mathbf{C}$ was optimized by learning the CRF model iteratively, and voting elements were obtained by CRF inference in the test image. We now describe the Hough voting procedure based on the CRF model that regards the LLC code of an image patch as a latent variable. A flowchart of the detection procedure is shown in Fig. 1. The voting element consistently casts weighted votes into the Hough image according to its LLC code. To locate the objects in the test image, maxima in the Hough image are regarded as object hypotheses. Moreover, to handle scale variations, a test image is resized by a set of scale factors, and hypotheses are computed independently in the Hough images at each scale.

Different from other Hough transform approaches,^{1}^{,}^{2}^{,}^{4}5.^{–}^{6}^{,}^{8}9.^{–}^{10} our Hough voting procedure is cast into a probabilistic framework with a coding strategy. Let $\mathbf{x}$ be the local feature observed at location $\tilde{\mathbf{l}}$ in the test image. By matching it to the visual codebook, a set of valid interpretations ${\mathbf{c}}_{i}$ with probabilities $p({\mathbf{c}}_{i}|\mathbf{x},\tilde{\mathbf{l}})$ can be obtained. If a codeword matches, it casts votes for different object positions. That is, for every ${\mathbf{c}}_{i}$, votes for several object categories ${O}_{n}$ and a position $\mathbf{h}$ can be obtained according to the learned spatial-occurrence distribution $p({O}_{n},\mathbf{h}|{\mathbf{c}}_{i},\tilde{\mathbf{l}})$. The voting probability of a local feature can be formally expressed by the following marginalization:

## (16)

$$p({O}_{n},\mathbf{h}|\mathbf{x},\tilde{\mathbf{l}})=\sum _{i}p({O}_{n},\mathbf{h}|\mathbf{x},{\mathbf{c}}_{i},\tilde{\mathbf{l}})p({\mathbf{c}}_{i}|\mathbf{x},\tilde{\mathbf{l}}),$$## (17)

$$p({O}_{n},\mathbf{h}|\mathbf{x},\tilde{\mathbf{l}})=\sum _{i}p({O}_{n},\mathbf{h}|{\mathbf{c}}_{i},\tilde{\mathbf{l}})p({\mathbf{c}}_{i}|\mathbf{x}),$$## (18)

$$=\sum _{i}p(\mathbf{h}|{O}_{n},{\mathbf{c}}_{i},\tilde{\mathbf{l}})p({O}_{n}|{\mathbf{c}}_{i},\tilde{\mathbf{l}})p({\mathbf{c}}_{i}|\mathbf{x}),$$Thus, the voting probability $p(\mathbf{h}|{O}_{n},{\mathbf{c}}_{i},\tilde{\mathbf{l}})$ is obtained by summing the votes for all stored observations from the learned occurrence distribution ${P}_{c}$. The ensemble of all such votes is used to obtain a nonparametric probability density estimate for the position of the object center.

The probability $p({\mathbf{c}}_{i}|\mathbf{x})$ of a match between a local feature and codeword is obtained according to the LLC algorithm^{19} described above. In other words, the LLC code $\mathbf{l}=L(\mathbf{x},\mathbf{C})$ is regarded as weighted probabilities for Hough voting.

Next, maxima are sought to be object hypotheses in the Hough voting space, in which all votes are accumulated. The search process includes two stages. We first accumulate the voting probabilities in a three-dimensional Hough space and find maxima as candidates. We then employ the mean-shift algorithm^{1} to refine the locations of hypotheses. Intuitively, the probability $p({O}_{n},\mathbf{h})$ of an object hypothesis is obtained by summing the individual voting probabilities $p({O}_{n},\mathbf{h},{\mathbf{x}}_{k},{\tilde{\mathbf{l}}}_{k})$ over all observations, and we arrive at the following equation:

## (22)

$$p({O}_{n},\mathbf{h})=\sum _{k}p({O}_{n},\mathbf{h}|{\mathbf{x}}_{k},{\tilde{\mathbf{l}}}_{k})p({\mathbf{x}}_{k},{\tilde{\mathbf{l}}}_{k}),$$^{1}is formulated with the following kernel density estimate:

## (23)

$$\widehat{p}({O}_{n},\mathbf{h})=\frac{1}{{V}_{b}}\sum _{k}\sum _{j}p({O}_{n},{\mathbf{h}}_{j}|{\mathbf{x}}_{k},{\tilde{\mathbf{l}}}_{k})G(\frac{\mathbf{h}-{\mathbf{h}}_{j}}{b}),$$Candidates of objects with high scores are usually close to each other in the Hough image. This may lead to the same object corresponding to multiple candidates, resulting in false positives. To reduce redundancy, we adopt NMS on the overlapped object hypotheses. We fix the intersection over union (IoU) threshold for NMS at 0.7.

## 4.

## Experiments

## 4.1.

### Datasets

To evaluate the effectiveness of the proposed method in different scenes, we choose three publicly available pedestrian datasets, namely, INRIA pedestrian, TUD Brussels, and Caltech pedestrian. Pedestrians in these datasets are mostly upright but are of different degrees of occlusions, and pose and scale changes, together with the variations in background and illuminations.

## 4.1.1.

#### INRIA Pedestrian

The INRIA pedestrian dataset consists of 614 training images and 288 test images, which is challenging due to the variability of pedestrian poses, illumination changes, and highly cluttered backgrounds (mountains, buildings, vehicles, etc.).

## 4.1.2.

#### TUD Brussels

The TUD Brussels dataset contains 508 images (one pair per second) at a resolution of $640\times 480$, which are recorded from a car driving in the inner city of Brussels. This dataset is challenging due to partial occlusion, cluttered backgrounds (e.g., poles, parked cars, buildings, and crowds), and numerous small-scale pedestrians.

## 4.1.3.

#### Caltech Pedestrian

The Caltech pedestrian dataset and its associated benchmark are among the most popular pedestrian detection datasets. It consists of about 10 h of videos (30 frames per second) collected from a vehicle driving through urban traffic. Every frame in the Caltech dataset has been densely annotated with the bounding boxes of pedestrian instances. In total, there are 350,000 bounding boxes of about 2300 unique pedestrians labeled in 250,000 frames. The pedestrians in the Caltech pedestrian dataset appear in many positions, orientations, and background variety. In the reasonable evaluation setting, the performance is evaluated on pedestrians over 50-pixels tall with no or partial occlusion.

## 4.2.

### Experiment Procedure

All experiments are carried out on a workstation equipped with a Titan Xp GPU and an Intel Xeon(R) CPU E5-2620 v4 @ 2.10 GHz. The evaluation tool is based on the codes from the official websites of Caltech and PASCAL VOC. Bounding boxes of objects are predicted in an image at test time. By default, predicted bounding boxes are considered positives when the IoU overlaps by more than 0.5 with ground-truth bounding boxes, and the rest are considered negatives. We use precision recall (PR) curve to evaluate pedestrian datasets.^{4}^{,}^{26}^{,}^{28} Following,^{9}^{,}^{28} we use average precision (AP) to measure detection performance on these datasets, which denotes the area under the PR curve. The AP was calculated in accordance with the criteria of PASCAL VOC.

We densely extract scale-invariant feature transform features from images with a step length of 16 pixels. The codebook is optimized by training the CRF model with 12 iterations. The matching threshold is set to 0.7 for learning the spatial-occurrence distribution of the optimized codebook $\mathbf{C}$ (Sec. 3.3). The number $K$ of LLC neighbors is set to 20. The codebook size $M$ is set to 512. Implemented on a CPU to detect pedestrians from the Caltech pedestrian dataset, the Hough transform-based ISM^{1} and Barinova et al.’s method^{4} require 0.48 and 0.55 s per image, respectively, whereas the proposed method requires 0.62 s per image. Our method only requires 0.14 s (per image) extra computational time than ISM, because it mainly benefits from the efficient LLC^{19} and inference algorithms in the CRF model.

## 4.3.

### Result Analysis

Figure 2 shows the PR curves of our method compared to conventional pedestrian detection approaches (HOG,^{21} FPDW,^{23} CrossTalk,^{25} LatSvm-V2,^{22} ACF,^{30} Roerei,^{26} MT-DPM,^{27} and NAMC^{32}) on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting. The APs of these methods are shown in Table 1. It can be observed that our method obtained obvious improvements over the Hough transform-based methods^{1}^{,}^{4}^{,}^{9} on these datasets. This is mainly attributable to two properties of our method that solve two challenging problems in the INRIA, TUD, and Caltech datasets: (i) the proposed method relies on image patches; hence, it can cope with the partial occlusions that are common in pedestrian datasets and (ii) the CRF model can effectively reduce the voting noise generated by the cluttered background.

## Table 1

Performance comparison in terms of AP (%) on the INRIA, TUD Brussels, and Caltech pedestrian datasets according to the reasonable setting.

Dataset | INRIA | TUD | Caltech |
---|---|---|---|

HOG^{21} | 73.3 | 40.1 | 26.5 |

LatSvm-V2^{22} | 91.0 | 51.5 | 35.9 |

Roerei^{26} | 93.9 | 54.8 | 51.9 |

FPDW^{23} | 88.3 | 60.3 | 40.3 |

CrossTalk^{25} | 88.7 | 60.0 | 45.1 |

ACF^{30} | 90.6 | 63.6 | 47.9 |

NAMC^{32} | 91.7 | — | 66.7 |

ISM^{1} | 86.0 | 54.2 | 49.5 |

Barinova et al.’s^{4} | 90.2 | 58.4 | 57.3 |

PHM^{9} | 86.5 | — | — |

Ours | 94.4 | 67.1 | 65.0 |

Note: The bold values denote the best detection performances in terms of AP.

We further evaluated the proposed method on three subsets of the Caltech pedestrian dataset according to its evaluation settings (“Occ = none,” “Occ = partial,” and “Occ = heavy”). Pedestrians are full, 65% to 100%, and 20% to 65% on those three settings, respectively. Table 2 shows that our method achieved APs of 66.4%, 47.3%, and 25.5% on these respective evaluation settings. Our method shows obvious improvements over the Hough transform-based methods^{1}^{,}^{4}^{,}^{9} on these evaluation settings.

## Table 2

Detection performance comparisons of our method and other methods on three Caltech evaluation settings (“Occ = none,” “Occ = partial,” and “Occ = heavy”).

Method | Occ = none | Occ = partial | Occ = heavy |
---|---|---|---|

MT-DPM + Context^{27} | 65.6 | 16.3 | 7.7 |

NAMC^{32} | 69.4 | 22.7 | 3.9 |

DeepCascade^{33} | 71.6 | 26.9 | 5.3 |

SCF + AlexNet^{46} | 80.5 | 34.5 | 15.3 |

TA-CNN^{35} | 81.4 | 45.9 | 16.4 |

SA-FastRCNN^{37} | 91.3 | 44.5 | 14.4 |

DeepParts^{47} | 89.5 | 67.1 | 24.2 |

F-DNN + SS^{38} | 92.8 | 60.4 | 30.9 |

Ours | 66.4 | 47.3 | 25.5 |

Note: The bold values denote the best detection performances in terms of AP.

For the TUD pedestrian dataset, we masked ground-truth objects with proportions of 20%, 40%, and 60% from the left to right side, respectively, owing to an absence of occlusion information in this dataset. As shown in Fig. 3, our method has obvious improvements on these masked proportions compared to Hough transform-based ISM^{1} and Barinova et al.’s^{4} method.

In addition, we verified the significance of codebook optimization, codebook size, number of LLC neighbors, and weighted voting strategy on detection performance.

## 4.3.1.

#### Impact of the codebook optimization

We initialized the codebook by the K-means clustering algorithm and then optimized the codebook by learning the CRF model. The codebook optimization was driven by top-down prior knowledge in a supervised manner. As shown in Fig. 4(a), detection performance improved rapidly in the first several iterations and converged after 12 iterations. The stochastic nature of the learning algorithm resulted in some performance perturbation in some iterations.

## 4.3.2.

#### Impact of the matching threshold

At test time, occurrence distributions of the codebook $C$ were used to cast votes into the Hough image for pedestrian detection; thus, they are significant to detection performance of the proposed method. Learning occurrence distributions mainly depends on the matching threshold that represents the similarity between a codeword and an object patch of a training image. Intuitively, the occurrence distributions may be impacted by noise when the matching threshold is set to a relatively low value. On the contrary, the occurrence distributions are likely to lack some important occurrences when the matching threshold is set to a relatively high value. To find the optimal matching threshold, we evaluated the detection performance with different values of the matching threshold. Figure 4(b) shows the detection results on the INRIA pedestrian and TUD Brussels datasets with different values of the matching threshold. We found that our method achieved a relatively high AP when the matching threshold was 0.7.

## 4.3.3.

#### Impact of the LLC parameter $K$

To focus on the impact of the number $K$ of LLC neighbors, the codebook size was fixed at 512. As shown in Fig. 4(c), detection performance improved dramatically when $K$ was $<15$, and it converged when $K$ was $>20$. The experimental results show that the number of LLC neighbors had a great impact on detection performance.

## 4.3.4.

#### Impact of the codebook size

To investigate the impact of codebook size on detection performance, we compared detection performance with codebook sizes of 256 and 512, with the parameter $K$ of LLC fixed at 20. As shown in Table 3, the AP was 92.6% when $M=256$ on the INRIA pedestrian dataset and 94.4% when $M=512$. The AP was 62.7% when $M=256$ on the TUD Brussels dataset and 67.1% when $M=512$. We found that $M=512$ gives better detection results than $M=256$.

## Table 3

Performance comparison in terms of codebook size M on the TUD Brussels and INRIA pedestrian datasets.

Method | TUD | INRIA |
---|---|---|

$M=256$ | 62.7 | 92.6 |

$M=512$ | 67.1 | 94.4 |

Note: The bold values denote the best detection performances in terms of AP.

## 4.3.5.

#### Performance of the weighted voting strategy

As for the weighted voting strategy (Sec. 3.4), we used the LLC coefficients instead of uniform weights as voting weights on codewords. The codebook size was fixed at 512. The parameter $K$ of LLC was fixed at 20. As shown in Table 4, the APs of the weighted voting were 4.0% and 2.9% higher, respectively, than the uniform voting on the INRIA pedestrian and TUD Brussels datasets.

## Table 4

Performance comparison in terms of voting strategies on the TUD Brussels and INRIA pedestrian datasets.

Method | TUD | INRIA |
---|---|---|

Uniform voting | 63.1 | 91.5 |

Weighted voting | 67.1 | 94.4 |

Note: The bold values denote the best detection performances in terms of AP.

## 4.3.6.

#### Effectiveness of the CRF model using the deep convolutional features

To investigate the effectiveness of the CRF model in detecting pedestrians using the deep convolutional features, we capture contextual relationships on the high-quality object candidates provided by the method RPN + BF.^{36} The region of interest (RoI) features of size $512\times 7\times 7$ are naturally extracted from the object candidates in the feature maps as in Ref. 36. An object candidate is regarded as a node in the CRF model within a fully connected form. The unary potential of the CRF model is the cost of the confidence score on an object candidate outputted by RPN + BF, which denotes the inverse likelihood of an object candidate taking the label of pedestrian. The pairwise potential relies on the RoI features of a pair of object candidates, which measures the cost of similar object candidates with different labels (e.g., the binary labels, pedestrian, and background) as in Refs. 48 and 49. We feed the RoI features of object candidates of all test images into the CRF model. Finally, the marginal probability distributions of all object candidates can be simultaneously obtained using the mean field inference in the CRF model. The PR curves are obtained by utilizing the marginal probabilities (as the confidence scores) of the pedestrian label, rather than utilizing the initial confidence scores provided by RPN + BF. In Fig. 5, it can be observed that the CRF model achieved APs of 98.7% and 93.2% on the INRIA and Caltech datasets, respectively, which obtains improvements of 1.3% and 2.2% over the RPN + BF.

## 5.

## Conclusion

In this work, we propose a pedestrian detection method that integrates context modeling and weighted voting strategy in a unified Hough transform framework. The noisy votes from background patches can be reduced by exploiting contextual information on image patches in an image. The coding coefficients based on the optimized codebook contribute to casting highly balanced votes in the Hough image. The experimental results on the INRIA pedestrian, TUD Brussels, and Caltech pedestrian datasets demonstrated the effectiveness of the proposed method compared with other Hough transform-based methods. In future studies, we intend to exploit contextual information among multiple images for pedestrian detection since the contextual information that we try to exploit in this work is only from a single image.

## Acknowledgments

This work was partially supported by the National Natural Science Foundation of China with Grant Nos. 61375008 and 61673274.

## References

## Biography

**Linfeng Jiang** received his BS degree in computer science from Chongqing University, Chongqing, China, in 2005, and his MS degree in computer science from Kunming University of Science and Technology, Kunming, China, in 2011. Currently, he is working toward his PhD in the Department of Automation, Shanghai Jiao Tong University (SJTU), Shanghai, China. He is interested in computer vision and probabilistic graphical theory for context modeling.

**Huilin Xiong** received his BSc and MSc degrees in mathematics from Wuhan University, Wuhan, China, in 1986 and 1989, respectively. He received his PhD in pattern recognition and intelligent control from the Institute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, China, in 1999. He joined SJTU, Shanghai, China, in 2007, and currently, he is a professor in the Department of Automation, SJTU. His research interests include pattern recognition, machine learning, and bioinfomatics.