Distance canonical correlation analysis with application to an imaging-genetic study

Abstract. Distance correlation is a measure that can detect both linear and nonlinear associations. However, applying distance correlation to imaging genetic studies often needs multiple testing correction due to the large number of multiple inferences. As a result, the sensitivity of its detection may be low. We propose a new model, distance canonical correlation analysis (DCCA), which overcomes this problem by searching a combination of features with the highest distance correlation. This is achieved by constructing a distance kernel function followed by solving a subsequent optimization problem. The ability to detect both linear and nonlinear associations makes DCCA suitable for analyzing complex multimodal and imaging-genetic associations. When applied to a brain imaging-genetic study from the Philadelphia Neurodevelopmental Cohort (PNC), DCCA detected several mental disorder-related gene pathways and brain networks. Experiments on brain connectivity found that the default mode network had strong nonlinear connections with other brain networks. When applied to the study of age effects, DCCA revealed that the connections of brain networks were relatively weak in younger groups but became stronger at older age stages. It indicates that adolescence is a vital stage for brain development. DCCA thus reveals a number of interesting findings and demonstrates a powerful new approach for analyzing multimodal brain imaging data.


Introduction
The brain is a complex organ and investigating its development and relationship with genomics is of great interest. Advances in neuroimaging, e.g., functional magnetic resonance imaging (fMRI), and sequencing of genetic variations, e.g., singular nucleotide polymorphism (SNP), have facilitated the analysis of the relationship between brain regions and genetic variations. FMRI detects changes in functional brain activity at each voxel, which can be clustered into regions of interest (ROI). SNPs are important genetic factors underlying differences in phenotypes among human beings. Association analyses, e.g., canonical correlation analysis (CCA), 1 have been conducted to study brain connectivity and how genetic factors and endophenotypes interact. 2 However, these methods typically use Pearson correlation which only captures linear relationships while nonlinear correlations may exist among brain regions. 3 To address the limitation of Pearson correlation-based methods, Székely et al. 4 proposed a correlation measurement, distance correlation, which evaluates the dependence between two single variables or two sets of variables. The property that distance correlation equals 0 if and only if two variables are independent enables it to detect both linear and nonlinear associations. Besides the ability to detect nonlinear correlations, the flexibility to detect both single-single feature correlations and set-set feature correlations also help distance correlation find many applications in imaging genetic and brain connectivity study. Geerligs et al. 5 investigated the dependence between different ROIs using multivariate distance correlation and the results tended to be more robust than using Pearson correlation. Fang et al. 6 investigated complex imaging genetics associations using projected distance correlation, which was more accurate and fast.
Székely and Rizzo 7 constructed a statistic to evaluate the statistical significance of the distance correlation between two single or two sets of variables. Despite the well-constructed theoretical work, a challenge for applying distance correlation exists in multiple testing correction. Large-scale simultaneous inference testing, e.g., genome wide association study (GWAS), needs multiple testing correction, e.g., Bonferroni correction, 8 in order to prevent erroneous inferences. For distance correlation, the scale of simultaneous inference is p × q (p, q are variable sizes of two datasets), which is much larger than that of GWAS, i.e., p. As a result, it might be difficult to detect significant variable-variable distance correlations due to the harsher testing correction. For testing the distance correlation between two subsets of variables, the scale of multiple inference testing is even larger, i.e., 2 pþq , and consequently the detection of significant associations becomes even more difficult.
To address the challenge, we propose a new framework, distance canonical correlation analysis (DCCA), which overcomes the problem by searching a combination of original features with the highest distance correlation. It is achieved by first constructing a distance kernel function and then solving a subsequent optimization problem. In this way, DCCA can detect both linear and nonlinear correlations and can identify a subset of features that are significantly correlated.
This work is an expansion of a preliminary work, "A hybrid correlation analysis with application to imaging genetics," 9 which was published in the proceedings of SPIE Medical Imaging 2018. This work refines the conference paper by adding more detailed procedures about the method and more applications on both the fusion of imaging genetics data and the fusion of multiple brain imaging data. The rest of this paper is organized as follows. Section 2 first introduces distance correlation with pros and cons and then discusses how the proposed model, DCCA, can overcome the limitation. Section 3 presents a simulation experiment test to verify the performance of DCCA. Section 4 presents the collection and preprocessing of data as well as the experiments of applying DCCA to detecting imaging genetic associations and brain connectivity study. Discussion and conclusions are in Sec. 5.

Distance Correlation
Distance correlation, proposed by Székely et al., 4 measures the dependence between two single variables or two sets of variables. Suppose we have two sets of random variables x ∈ R p and y ∈ R q (where p, q represent the feature sizes of x, y, respectively) with characteristic functions f x and f y . Variable dimensionality p, q can either be 1 (two single variable case) or greater than 1 (two sets of variables case). The distance covariance between x and y is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 5 0 6 where j Ã j p , j Ã j q denote the Euclidean norm in space R p and R q , respectively; and f x;y denotes the joint characteristic function of x and y. The distance correlation between x and y is defined as It has been proved that distance correlation gets 0 iff x and y are independent, i.e., E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 3 0 0 dCorðx; yÞ ¼ 0 ⇔ x⫫y: Distance correlation outperforms conventional Pearson correlation in that it can detect both linear and nonlinear associations due to Eq. (3).
For sample data X ∈ R n×p and Y ∈ R n×q , where n denotes sample size, the empirical distance covariance between X and Y can be estimated as follows. First, we calculate the Euclidean distance between each sample pair E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 2 . 1 ; 6 3 ; 1 9 2 ðY ik − Y jk Þ 2 s ; i;j ¼ 1;2; · · · ; n: Second, U-centering is applied to the Euclidean distance a i;j as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 3 2 6 ; 7 5 2 The U-centered B i;j can be calculated similarly, i.e., applying U-centering to Euclidean distance b i;j . Then, the empirical distance correlation can be calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 6 6 3 A statistic following a t-distribution provided by Székely and Rizzo 7 is used to evaluate the significance of distance correlation as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 5 8 0

Kernel Methods
Kernel methods are also widely used when data have nonlinear relationships. Kernel methods map original variable space R p to a higher dimensional space R P (P can be either ∞ or a number greater than p) via a mapping function ϕ as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 4 5 1 ϕ∶x ∈ R p ↦ ϕðxÞ ∈ R P : In order to reduce computational complexity and to avoid computing in R ∞ , kernel trick is used to compute with a kernel function instead of an explicit mapping function. A kernel function is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 3 2 6 ; 3 8 3 where x 1 , x 2 ∈ R p are two samples and h·; ·i R P denotes the inner product in R P space.

Distance Canonical Correlation Analysis
Distance correlation provides a way to evaluate the dependence between two single variables or two sets of variables. Given two datasets X ∈ R n×p and Y ∈ R n×q , it is of interest to identify which two single variables x 1 ∈ R n×1 and y 1 ∈ R n×1 are significantly dependent by computing their distance correlation. However, it may be difficult to detect significant distance correlations due to multiple testing correction. Multiple testing correction, e.g., Bonferroni correction, 8 is used to counteract the problem of multiple comparisons when conducting a large scale of statistical inference simultaneously, e.g., GWAS. For GWAS study, the scale of simultaneous inference is the variable/feature size p. For univariate distance correlation (distance correlation between two single variables), the scale of simultaneous inference is p × q, which is much larger than that of GWAS, i.e., p.
In data application, it is usually of interest to study groups of variables rather than a single feature. For examples, complex phenotypes and diseases may be regulated by a group of genes and pathways. For brain imaging data, different brain regions function and harmonize in a connected network when performing a specific brain function. 10 Therefore, it is of interest Journal of Medical Imaging 026501-2 Apr-Jun 2019 • Vol. 6 (2) to identify two subsets/groups of variables X sub ∈ R n×r ð1 ≤ r ≤ pÞ and Y sub ∈ R n×s ð1 ≤ s ≤ qÞ which are significantly dependent. However, the scale of simultaneous inference in this case is very large, i.e., 2 pþq , making it more difficult to detect significantly dependent subsets. Motivated by the problem in detecting significant distance correlation, we develop a multivariate approach, namely DCCA, to seek the optimal combination of original variables with the highest distance correlations. Given two datasets X ∈ R n×p and Y ∈ R n×q , distance CCA first projects original samples to a higher dimensional space as in the following procedure.
For any two single features x 1 , x 2 ∈ R n×1 from data X, a distance kernel is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 6 0 9 kðx 1 ; x 2 Þ ≔ X n i;j¼1 where x Ã;i , x Ã;j denote the i'th and j'th elements of x Ã ðÃ ¼ 1;2Þ, respectively; and the corresponding mapping function is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 6 3 ; It is easy to check that Eq. (9) is a well-defined inner product in a reproducing kernel Hilbert space. With distance kernel constructed, a multivariate method is used to find the optimal combination of original features/variables with the highest distance correlation by solving the optimization problem as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 6 3 ; 4 3 8 KðX; YÞ i;j ≔ Kðx i ; y j Þ; KðX; YÞ i;j denotes the ði; jÞ'th element of KðX; YÞ; and x i , y i denote the i'th column of X, Y, respectively. The detailed algorithm for the proposed model, DCCA, and the detailed procedures of solving the optimization problem [Eq. (11)] are described in Algorithm 1.
The framework of distance CCA is similar to that of kernel CCA, which is another nonlinear methods, and therefore we call the constructed Gram matrix [Eqs. (9) and (10)] "distance kernel." However, it is noteworthy that distance CCA differs from conventional kernel CCA and cannot be regarded as kernel CCA with a newly defined kernel function. For kernel CCA, there are a number of options for kernel functions, e.g., Gaussian radial basis function kernel, polynomial kernel, etc. The choice of kernel function depends on data distributions and the hidden relationship pattern within the data. Distance kernel function [Eqs. (9) and (10)] differs from conventional kernel function in that distance kernel retains the original feature information [for X ∈ R n×p , distance kernel operation KðX; XÞ ∈ R p×p ] while conventional kernel function breaks the original feature structure [for X ∈ R n×p , kernel operation KðX; XÞ ∈ R n×n ]. The retaining of original feature structure enables distance CCA to perform feature selection which can facilitate subsequent result interpretation. In comparison, it is difficult to interpret the result of kernel CCA since the original feature information is lost after kernel mapping.

Simulation Test
To illustrate the strengths and limitations of our method, namely DCCA, we conducted a simulation study and compared the performances of DCCA to that of linear CCA. For performance comparison, two aspects were considered: correlation detection and feature selection.

Synthetic Data
We employed a latent variable model, 11 also used in works, 12,13 to simulate two correlated data X ∈ R n×p , Y ∈ R n×q , where n represents sample/subject size, and p, q represent feature size. Suppose we have two latent variables u 1 ∈ R n×1 , u 2 ∈ R n×1 , and u 1 , u 2 are correlated. The correlation between data X and Y can be generated by loading u 1 and u 2 as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 2 7 8 where E X ∈ R n×p and E Y ∈ R n×q are background Gaussian noise, and α 1 ∈ R p×1 and α 2 ∈ R q×1 are loading vectors of latent variables.

Three Types of Data Dependence Scenarios
In order to perform a comprehensive comparison, three types of data dependence scenarios were considered, including independence, linear dependence, and nonlinear dependence, as shown in Figs. 1(a)-1(c). The correlation between data X and Y originates from the correlation between latent variables u 1 , u 2 . Therefore, the three types of correlation scenarios can be generated by enforcing different relationship patterns on u 1 , u 2 . Three relation patterns were used, i.e., independence, sine function, and linear function, as shown in Figs. 1(a)-1(c), respectively.

Results of Simulation Test
In each scenario, we implemented both CCA and our model, DCCA, to detect both correlation and true correlated features between two datasets. Note that loading vectors α 1 , α 2 were sparse vectors in our experiment setting, i.e., most of the elements were zeros. The numbers of features, i.e., p, q, were 100 in our setting, among which only 20 features were set as true correlated features. That is to say, the length of loading vectors α 1 ∈ R p×1 and α 2 ∈ R q×1 was 100, and only 20 of their elements were nonzeros, as shown at the top of Fig. 2. An ideal method should be able to accurately detect the crossdata correlation and also the 20 true correlated features.
The results are shown in Figs

Brain Imaging Data and Brain Connectivity
The DCCA was then applied to a brain development study focused on two experiments. One experiment is to study the imaging-genetic associations (Sec. 4.3) and the other one is to study the connections between different brain subnetworks or subdomains, e.g., default mode network (DMN), and how the connections change across different age stages (Sec. 4.5). Imaging-genetic study analyzes the correlation between fMRI data, which detects the change of the brain functional activity at voxel level and SNPs data. SNPs are important genetic factors underlying differences in phenotypes among human beings. Genetic factors may function as a complicated group, e.g., protein-protein interaction network, gene pathway, when regulating a certain phenotype or disease. Similarly, neurons and brain regions also function and harmonize in a connected network when performing a specific brain function. 10 Therefore, distance CCA, which seeks the optimal combinations of features with the strongest cross-data associations, might be superior in detecting group-group nonlinear associations between brain imaging scans and genetic factors.

Data Collection and Preprocessing
The Philadelphia Neurodevelopmental Cohort (PNC) 15 is a large-scale collaborative study between the Brain Behavior Laboratory at the University of Pennsylvania and the Children's Hospital of Philadelphia. The data include fMRI and SNPs data of adolescents aged from 8 to 21 years. The fMRI data were collected during a resting state from 857 subjects. After the collection of raw fMRI data, SPM12 16 was used to conduct motion correction, spatial normalization, spatial smoothing with a 3×3-mm Gaussian kernel, and multiple regression to mitigate the influence of motion. Finally, 264 ROIs (containing 21,384 voxels) were extracted based on the power coordinates 17 with a sphere radius parameter of 5 mm. SNPs data were collected from 7863 subjects based on four platforms, Illumina Human610Quadv1, HumanHap550v1, HumanHap550v3, and HumanOmniExpress. SNPs with >5% missing values were deleted and the rest missing values were further imputed using Plink. 18,19 Then, the SNPs within gene bodies were kept, resulting in 95,639 SNPs.

Imaging-Genetic Associations
In order to implement distance CCA, the subjects having both fMRI and SNP data are further extracted, resulting in 855 subjects. For fMRI data, the stimulus-on versus stimulus-off contrast was obtained from the raw resting-state time series data. To find the interactions that are more related to mental disorders, SNPs located in genes associated with brain disorders were kept, where the brain disorders included schizophrenia, bipolar disorder, depression, attention-deficit/hyperactivity disorder, and post-traumatic stress disorder. Finally, 736 genes containing 21,487 SNPs were left for further analysis.
When applied to detect the group associations between fMRI and SNPs, distance CCA identified a subset of 45 genes and a subset of 15 ROIs that were strongly correlated. The distance correlation between the identified ROIs and genes was 0.2047 with p-value of 6.58e − 30 (calculated based on Eq. 6). In comparison, the largest single ROI-gene distance correlation is 0.1759 with p-value of 1.29e − 18. This demonstrated that distance covariance-based CCA can find a pair of variable groups with an enhanced distance correlation and significance level. The lists of the identified genes and ROIs are in Tables 1 and 2, respectively. The locations of the identified ROIs are further visualized in Fig. 5 using the BrainNet Viewer toolbox. 14,20 After that, gene enrichment analysis was conducted to reveal the underlying biological functions of the identified genes. Ten pathways were selected with a screening of q-value <0.05 (q-value represents the multiple testing corrected p-value), and the pathways together with their corresponding q-values were listed in Table 3. P-values are calculated using the hypergeometric test based on the numbers of genes in the particular biological pathway and the identified gene set. The q-values are then calculated by correcting the p-values using multiple testing correction, e.g., Bonferroni correction, 8 based on the false discovery rate method. Among the identified pathways, pathways "neurodegenerative diseases," "oxidative damage," and "deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models" have been reported to be related to neuron activities and brain development. Pathway "neurodegenrative diseases" is related to the death of neurons and corticobasal degeneration, which might further lead to the progressive  Fig. 1(a)].
Journal of Medical Imaging 026501-5 Apr-Jun 2019 • Vol. 6 (2) dysfunction in the brain and a number of mental disorders. 21 Pathway "oxidative damage," which is related to cell signaling, may lead to damage of cell and the death of neurons. 22 It may be related to the pathogenesis of several neural degenerative diseases, including Parkinson's disease, 23 depression, 24 and Alzheimer's disease. 25 For pathway "deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models," abnormal CDK5 may result in unregulated activation of the cycle of cell, 26 which might further lead to the death of neurons. 27 Mental disorders, such as Alzheimer's disease, may occur if CDK5 is deregulated. 28 The interactions of pathway "neurodegenerative diseases" and "deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models" are visualized in Figs. 6 and 7, respectively. Figure 6 was plotted using Cytoscape software, 29 which was an open source platform for visualizing complex networks. Figure 7 was generated using reactome pathway database. 30 For brain imaging data, as shown in Table 2, the majority (13/ 15) of the detected ROIs are from three brain subdomains: sensorimotor network (SM), DMN, and visual network (VIS). SM is related to the coordination of the body when performing motor tasks. 31 DMN is the dominant network when subjects are in resting state, mind-wandering, or not involved in a specific task. Dysfunction within the DMN has been associated with several mental disorders, 32,33 e.g., schizophrenia, depression, autism, etc. Associations between DMN and genetic factors exist according to a multivariate study of schizophrenia subjects scanned during the resting state. 34 Fig. 3 Performance comparison between CCA and DCCA [nonlinear dependence scenario: Fig. 1(b)].

Functional Connectivity Between Brain Subnetworks
For brain FC study, we selected five brain subnetworks or subdomains and then applied distance CCA to investigate the connections between each subnetwork pair and to study the age effects on the connections. Resting-state fMRI was used in this experiment and data were preprocessed using group ICA of fMRI toolbox 35 for independent component analysis (ICA). 36 The five brain subnetworks include SM, VIS, cognitive control network (CCN), auditory network (AUD), and DMN, and the corresponding locations in the brain are shown in Fig. 8.
In order to investigate both linear and nonlinear connections of the brain, we applied both linear CCA and distance CCA to the PNC data and the results are shown in Fig. 9. The results were based on a 10-fold cross-validation, in which each time five folds were used as training data and the rest five folds were used as testing data. It is noteworthy that the metric of distance correlation is different from that of linear Pearson correlation, e.g., distance correlation ¼0.4 ⇎ Pearson correlation ¼0.4. Nevertheless, distance correlation reflects the relative

Ages Effects on Brain FC
It is of interest to investigate how brain connectivity changes during adolescence and how it changes across different age stages, e.g., children and young adults, which may further contribute to the study of normal and pathological brain development. Three age groups, 8 to 11 years, 13 to 16 years, and 18 to 22 years, were selected and then distance CCA was applied to each age group to analyze brain network connections. Subjects aged 12 and 17 years were not included in the experiments in order to get a clear boundary between different age groups. The connections between brain subnetworks for each age group are shown in Fig. 10. From Fig. 10, the patterns of the connections are different between different age groups. For instance, the connections between different brain networks are relatively weaker at age 8 to 11 but become relatively stronger at age 13 to 16 and age 18 to 22. It demonstrates that different brain regions become more and more connected during adolescence, which may be a result of the training and development of the brain during multiple types of brain activities. Moreover, it seems that the connections between CCN and SM are weak across all three age groups, which indicates that the connection between CCN and SM may be weak at the adolescent stage.

Discussion and Conclusion
In this work, we proposed a new model, DCCA, which overcomes the limitation of distance correlation in detecting significant associations when feature size is large. Conventional distance correlation analysis needs large-scale multiple testing when testing feature-feature association simultaneously. The proposed model, DCCA, addresses the problem by searching a combination of original features with the highest distance correlation. It is achieved by first constructing a distance kernel Fig. 7 The interaction mechanisms of the pathway "deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer's disease models." Journal of Medical Imaging 026501-9 Apr-Jun 2019 • Vol. 6 (2) function and then solving an optimization problem. The ability to detect nonlinear group-group associations makes DCCA more suitable for analyzing complex multi-omics and imaging-genetic associations, in which both genetic factors and brain ROIs may work as groups when regulating a phenotype or performing a specific brain function. When applied to imaging-genetic association study, DCCA detected a strong correlation between a subset of genes and a subset of brain ROIs with an improved significance level. Several neuron degeneration and mental disorder related pathways were enriched from the identified genes after gene enrichment analysis, which demonstrated the biological significance of our findings. In addition, DCCA found several mental disorder-related brain networks which had been reported by existing literature. Experiments on brain connectivity study also found several new discoveries using DCCA. Brain network DMN, which is considered to be distinct from other brain domains/networks, may have strong nonlinear connections with other brain networks according to the results of DCCA. When applied to analyzing each age groups, DCCA reveals that younger groups (8 to 11 years) exhibit weak connections of brain networks while the connections become strong at an older age stage (13 to 16 and 18 to 22) which may a result of brain development. The discoveries of imaging genetic associations and brain connections verified the performance of DCCA.
Besides the examples in this study, it may find more Fig. 8 The sagittal, coronal, and axial views of brain functional network domains extracted via group ICA. The names of the brain network domains are: SM, AUD, VIS, DMN, CCN, and salience network (SAL). applications in multiimaging and multi-omics studies, where identifying correlations between multiple datasets is a common challenge.

Disclosures
The authors have no relevant financial interests in the paper and no other potential conflicts of interest to disclose.