The brain is a complex organ and investigating its development and relationship with genomics is of great interest. Advances in neuroimaging, e.g., functional magnetic resonance imaging (fMRI), and sequencing of genetic variations, e.g., singular nucleotide polymorphism (SNP), have facilitated the analysis of the relationship between brain regions and genetic variations. FMRI detects changes in functional brain activity at each voxel, which can be clustered into regions of interest (ROI). SNPs are important genetic factors underlying differences in phenotypes among human beings. Association analyses, e.g., canonical correlation analysis (CCA),1 have been conducted to study brain connectivity and how genetic factors and endophenotypes interact.2 However, these methods typically use Pearson correlation which only captures linear relationships while nonlinear correlations may exist among brain regions.3
To address the limitation of Pearson correlation-based methods, Székely et al.4 proposed a correlation measurement, distance correlation, which evaluates the dependence between two single variables or two sets of variables. The property that distance correlation equals 0 if and only if two variables are independent enables it to detect both linear and nonlinear associations. Besides the ability to detect nonlinear correlations, the flexibility to detect both single-single feature correlations and set–set feature correlations also help distance correlation find many applications in imaging genetic and brain connectivity study. Geerligs et al.5 investigated the dependence between different ROIs using multivariate distance correlation and the results tended to be more robust than using Pearson correlation. Fang et al.6 investigated complex imaging genetics associations using projected distance correlation, which was more accurate and fast.
Székely and Rizzo7 constructed a statistic to evaluate the statistical significance of the distance correlation between two single or two sets of variables. Despite the well-constructed theoretical work, a challenge for applying distance correlation exists in multiple testing correction. Large-scale simultaneous inference testing, e.g., genome wide association study (GWAS), needs multiple testing correction, e.g., Bonferroni correction,8 in order to prevent erroneous inferences. For distance correlation, the scale of simultaneous inference is (, are variable sizes of two datasets), which is much larger than that of GWAS, i.e., . As a result, it might be difficult to detect significant variable–variable distance correlations due to the harsher testing correction. For testing the distance correlation between two subsets of variables, the scale of multiple inference testing is even larger, i.e., , and consequently the detection of significant associations becomes even more difficult.
To address the challenge, we propose a new framework, distance canonical correlation analysis (DCCA), which overcomes the problem by searching a combination of original features with the highest distance correlation. It is achieved by first constructing a distance kernel function and then solving a subsequent optimization problem. In this way, DCCA can detect both linear and nonlinear correlations and can identify a subset of features that are significantly correlated.
This work is an expansion of a preliminary work, “A hybrid correlation analysis with application to imaging genetics,”9 which was published in the proceedings of SPIE Medical Imaging 2018. This work refines the conference paper by adding more detailed procedures about the method and more applications on both the fusion of imaging genetics data and the fusion of multiple brain imaging data. The rest of this paper is organized as follows. Section 2 first introduces distance correlation with pros and cons and then discusses how the proposed model, DCCA, can overcome the limitation. Section 3 presents a simulation experiment test to verify the performance of DCCA. Section 4 presents the collection and preprocessing of data as well as the experiments of applying DCCA to detecting imaging genetic associations and brain connectivity study. Discussion and conclusions are in Sec. 5.
Distance correlation, proposed by Székely et al.,4 measures the dependence between two single variables or two sets of variables. Suppose we have two sets of random variables and (where , represent the feature sizes of , , respectively) with characteristic functions and . Variable dimensionality , can either be 1 (two single variable case) or greater than 1 (two sets of variables case). The distance covariance between and is defined as
The distance correlation between and is defined as
It has been proved that distance correlation gets 0 iff and are independent, i.e.,
Distance correlation outperforms conventional Pearson correlation in that it can detect both linear and nonlinear associations due to Eq. (3).
For sample data and , where denotes sample size, the empirical distance covariance between and can be estimated as follows. First, we calculate the Euclidean distance between each sample pair
Second, U-centering is applied to the Euclidean distance as
The U-centered can be calculated similarly, i.e., applying U-centering to Euclidean distance . Then, the empirical distance correlation can be calculated as
A statistic following a -distribution provided by Székely and Rizzo7 is used to evaluate the significance of distance correlation as
Kernel methods are also widely used when data have nonlinear relationships. Kernel methods map original variable space to a higher dimensional space ( can be either or a number greater than ) via a mapping function as
In order to reduce computational complexity and to avoid computing in , kernel trick is used to compute with a kernel function instead of an explicit mapping function. A kernel function is defined as
Distance Canonical Correlation Analysis
Distance correlation provides a way to evaluate the dependence between two single variables or two sets of variables. Given two datasets and , it is of interest to identify which two single variables and are significantly dependent by computing their distance correlation. However, it may be difficult to detect significant distance correlations due to multiple testing correction. Multiple testing correction, e.g., Bonferroni correction,8 is used to counteract the problem of multiple comparisons when conducting a large scale of statistical inference simultaneously, e.g., GWAS. For GWAS study, the scale of simultaneous inference is the variable/feature size . For univariate distance correlation (distance correlation between two single variables), the scale of simultaneous inference is , which is much larger than that of GWAS, i.e., .
In data application, it is usually of interest to study groups of variables rather than a single feature. For examples, complex phenotypes and diseases may be regulated by a group of genes and pathways. For brain imaging data, different brain regions function and harmonize in a connected network when performing a specific brain function.10 Therefore, it is of interest to identify two subsets/groups of variables and which are significantly dependent. However, the scale of simultaneous inference in this case is very large, i.e., , making it more difficult to detect significantly dependent subsets.
Motivated by the problem in detecting significant distance correlation, we develop a multivariate approach, namely DCCA, to seek the optimal combination of original variables with the highest distance correlations. Given two datasets and , distance CCA first projects original samples to a higher dimensional space as in the following procedure.
For any two single features , from data , a distance kernel is defined as
It is easy to check that Eq. (9) is a well-defined inner product in a reproducing kernel Hilbert space. With distance kernel constructed, a multivariate method is used to find the optimal combination of original features/variables with the highest distance correlation by solving the optimization problem as
The detailed algorithm for the proposed model, DCCA, and the detailed procedures of solving the optimization problem [Eq. (11)] are described in Algorithm 1.
Algorithm for DCCA.
|1: Input, , initial loading vectors ,|
|2: Output Optimal loading vectors ,|
|3: Construct distance kernel Gram matrices|
|8: Solve optimization problem [Eq. (11)]|
The framework of distance CCA is similar to that of kernel CCA, which is another nonlinear methods, and therefore we call the constructed Gram matrix [Eqs. (9) and (10)] “distance kernel.” However, it is noteworthy that distance CCA differs from conventional kernel CCA and cannot be regarded as kernel CCA with a newly defined kernel function. For kernel CCA, there are a number of options for kernel functions, e.g., Gaussian radial basis function kernel, polynomial kernel, etc. The choice of kernel function depends on data distributions and the hidden relationship pattern within the data. Distance kernel function [Eqs. (9) and (10)] differs from conventional kernel function in that distance kernel retains the original feature information [for , distance kernel operation ] while conventional kernel function breaks the original feature structure [for , kernel operation ]. The retaining of original feature structure enables distance CCA to perform feature selection which can facilitate subsequent result interpretation. In comparison, it is difficult to interpret the result of kernel CCA since the original feature information is lost after kernel mapping.
To illustrate the strengths and limitations of our method, namely DCCA, we conducted a simulation study and compared the performances of DCCA to that of linear CCA. For performance comparison, two aspects were considered: correlation detection and feature selection.
We employed a latent variable model,11 also used in works,12,13 to simulate two correlated data , , where represents sample/subject size, and , represent feature size. Suppose we have two latent variables , , and , are correlated. The correlation between data and can be generated by loading and as follows:
Three Types of Data Dependence Scenarios
In order to perform a comprehensive comparison, three types of data dependence scenarios were considered, including independence, linear dependence, and nonlinear dependence, as shown in Figs. 1(a)–1(c). The correlation between data and originates from the correlation between latent variables , . Therefore, the three types of correlation scenarios can be generated by enforcing different relationship patterns on , . Three relation patterns were used, i.e., independence, sine function, and linear function, as shown in Figs. 1(a)–1(c), respectively.
Results of Simulation Test
In each scenario, we implemented both CCA and our model, DCCA, to detect both correlation and true correlated features between two datasets. Note that loading vectors , were sparse vectors in our experiment setting, i.e., most of the elements were zeros. The numbers of features, i.e., , , were 100 in our setting, among which only 20 features were set as true correlated features. That is to say, the length of loading vectors and was 100, and only 20 of their elements were nonzeros, as shown at the top of Fig. 2. An ideal method should be able to accurately detect the cross-data correlation and also the 20 true correlated features.
The results are shown in Figs. 2, 3, 4, for the three scenarios, respectively. In each figure, the top two subfigures represent the ground truth of the true correlated features, and the bottom four subfigures represent the identified features by CCA and DCCA, respectively. From Fig. 2, when two data are independent, both CCA and DCCA detect a weak correlation (CCA: 0.0739 versus DCCA: 0.1019) and neither method can identify true correlated features. From Fig. 4, when two data follow a linear relationship, both CCA and DCCA can detect a strong correlation (CCA: 0.9807 versus DCCA: 0.9525) and both methods can accurately identify the true correlated features. From Fig. 3, when two data follow a nonlinear relationship, CCA cannot detect the correlation (CCA: 0.0886) and cannot identify the true correlated features. In comparison, DCCA can detect the nonlinear correlation (DCCA: 0.6772) and also the true correlated features. The results in the three scenarios, i.e., Figs. 1–4, verified the superior performance of DCCA over conventional CCA in terms of detecting both complex correlations and true correlated features.
Application to Brain Imaging Data
Brain Imaging Data and Brain Connectivity
The DCCA was then applied to a brain development study focused on two experiments. One experiment is to study the imaging-genetic associations (Sec. 4.3) and the other one is to study the connections between different brain subnetworks or subdomains, e.g., default mode network (DMN), and how the connections change across different age stages (Sec. 4.5). Imaging-genetic study analyzes the correlation between fMRI data, which detects the change of the brain functional activity at voxel level and SNPs data. SNPs are important genetic factors underlying differences in phenotypes among human beings. Genetic factors may function as a complicated group, e.g., protein–protein interaction network, gene pathway, when regulating a certain phenotype or disease. Similarly, neurons and brain regions also function and harmonize in a connected network when performing a specific brain function.10 Therefore, distance CCA, which seeks the optimal combinations of features with the strongest cross-data associations, might be superior in detecting group–group nonlinear associations between brain imaging scans and genetic factors.
Data Collection and Preprocessing
The Philadelphia Neurodevelopmental Cohort (PNC)15 is a large-scale collaborative study between the Brain Behavior Laboratory at the University of Pennsylvania and the Children’s Hospital of Philadelphia. The data include fMRI and SNPs data of adolescents aged from 8 to 21 years. The fMRI data were collected during a resting state from 857 subjects. After the collection of raw fMRI data, SPM1216 was used to conduct motion correction, spatial normalization, spatial smoothing with a 3×3-mm Gaussian kernel, and multiple regression to mitigate the influence of motion. Finally, 264 ROIs (containing 21,384 voxels) were extracted based on the power coordinates17 with a sphere radius parameter of 5 mm. SNPs data were collected from 7863 subjects based on four platforms, Illumina Human610Quadv1, HumanHap550v1, HumanHap550v3, and HumanOmniExpress. SNPs with missing values were deleted and the rest missing values were further imputed using Plink.18,19 Then, the SNPs within gene bodies were kept, resulting in 95,639 SNPs.
In order to implement distance CCA, the subjects having both fMRI and SNP data are further extracted, resulting in 855 subjects. For fMRI data, the stimulus-on versus stimulus-off contrast was obtained from the raw resting-state time series data. To find the interactions that are more related to mental disorders, SNPs located in genes associated with brain disorders were kept, where the brain disorders included schizophrenia, bipolar disorder, depression, attention-deficit/hyperactivity disorder, and post-traumatic stress disorder. Finally, 736 genes containing 21,487 SNPs were left for further analysis.
When applied to detect the group associations between fMRI and SNPs, distance CCA identified a subset of 45 genes and a subset of 15 ROIs that were strongly correlated. The distance correlation between the identified ROIs and genes was 0.2047 with -value of (calculated based on Eq. 6). In comparison, the largest single ROI–gene distance correlation is 0.1759 with -value of . This demonstrated that distance covariance-based CCA can find a pair of variable groups with an enhanced distance correlation and significance level. The lists of the identified genes and ROIs are in Tables 1 and 2, respectively. The locations of the identified ROIs are further visualized in Fig. 5 using the BrainNet Viewer toolbox.14,20
The genes identified by DCCA.
The identified brain ROIs. X, Y, Z represent ROI coordinates in the Montreal Neurological Institute (MNI) space.
|X||Y||Z||ROI name||Suggested system|
|13||75||Postcentral gyrus||Sensory/somatomotor hand|
|29||71||Precentral gyrus||Sensory/somatomotor hand|
|44||57||Precentral gyrus||Sensory/somatomotor hand|
|75||Precentral gyrus||Sensory/somatomotor hand|
|66||25||Precentral gyrus||Sensory/somatomotor mouth|
|65||20||Superior temporal gyrus||Auditory|
|13||55||38||Superior frontal gyrus||Default mode|
|55||39||Superior frontal gyrus||Default mode|
|6||64||22||Medial frontal gyrus||Default mode|
|65||Middle temporal gyrus||Default mode|
|52||7||Middle temporal gyrus||Default mode|
|51||17||Superior frontal gyrus||Salience|
After that, gene enrichment analysis was conducted to reveal the underlying biological functions of the identified genes. Ten pathways were selected with a screening of -value (-value represents the multiple testing corrected -value), and the pathways together with their corresponding -values were listed in Table 3. -values are calculated using the hypergeometric test based on the numbers of genes in the particular biological pathway and the identified gene set. The -values are then calculated by correcting the -values using multiple testing correction, e.g., Bonferroni correction,8 based on the false discovery rate method. Among the identified pathways, pathways “neurodegenerative diseases,” “oxidative damage,” and “deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer’s disease models” have been reported to be related to neuron activities and brain development. Pathway “neurodegenrative diseases” is related to the death of neurons and corticobasal degeneration, which might further lead to the progressive dysfunction in the brain and a number of mental disorders.21 Pathway “oxidative damage,” which is related to cell signaling, may lead to damage of cell and the death of neurons.22 It may be related to the pathogenesis of several neural degenerative diseases, including Parkinson’s disease,23 depression,24 and Alzheimer’s disease.25 For pathway “deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer’s disease models,” abnormal CDK5 may result in unregulated activation of the cycle of cell,26 which might further lead to the death of neurons.27 Mental disorders, such as Alzheimer’s disease, may occur if CDK5 is deregulated.28 The interactions of pathway “neurodegenerative diseases” and “deregulated CDK5 triggers multiple neurodegenerative pathways in Alzheimer’s disease models” are visualized in Figs. 6 and 7, respectively. Figure 6 was plotted using Cytoscape software,29 which was an open source platform for visualizing complex networks. Figure 7 was generated using reactome pathway database.30
Gene enrichment analysis of the identified genes. Q-values represent multiple testing corrected p-value.
|Chk1/Chk2(Cds1)-mediated inactivation of cyclin B:Cdk1||Reactome||0.00032||0.012|
|Activation of BAD and translocation to mitochondria||Reactome||0.00044||0.013|
|Deregulated CDK5 triggers multiple neurodegenerative||Reactome||0.00063||0.013|
|Pathways in Alzheimer’s disease models|
|Activation of BH3-only proteins||Reactome||0.0018||0.023|
|Class I PI3K signaling events mediated by Akt||PID||0.0024||0.027|
|LKB1 signaling events||PID||0.0036||0.029|
|Intrinsic pathway for apoptosis||Reactome||0.0036||0.029|
For brain imaging data, as shown in Table 2, the majority (13/15) of the detected ROIs are from three brain subdomains: sensorimotor network (SM), DMN, and visual network (VIS). SM is related to the coordination of the body when performing motor tasks.31 DMN is the dominant network when subjects are in resting state, mind-wandering, or not involved in a specific task. Dysfunction within the DMN has been associated with several mental disorders,32,33 e.g., schizophrenia, depression, autism, etc. Associations between DMN and genetic factors exist according to a multivariate study of schizophrenia subjects scanned during the resting state.34
Functional Connectivity Between Brain Subnetworks
For brain FC study, we selected five brain subnetworks or subdomains and then applied distance CCA to investigate the connections between each subnetwork pair and to study the age effects on the connections. Resting-state fMRI was used in this experiment and data were preprocessed using group ICA of fMRI toolbox35 for independent component analysis (ICA).36 The five brain subnetworks include SM, VIS, cognitive control network (CCN), auditory network (AUD), and DMN, and the corresponding locations in the brain are shown in Fig. 8.
In order to investigate both linear and nonlinear connections of the brain, we applied both linear CCA and distance CCA to the PNC data and the results are shown in Fig. 9. The results were based on a 10-fold cross-validation, in which each time five folds were used as training data and the rest five folds were used as testing data. It is noteworthy that the metric of distance correlation is different from that of linear Pearson correlation, e.g., distance correlation Pearson correlation . Nevertheless, distance correlation reflects the relative strength of the dependence between two variables. From Fig. 9, strong linear connections are detected between each pair of SM, VIS, CCN, and AUD networks, while the linear connections between DMN and other networks are weak. Research32 has shown that DMN may have strong intrinsic connections while the connections between DMN and the rest networks are weak in the resting state, which is consistent with the result of linear CCA. In comparison, distance CCA detected stronger DMN-SM, DMN-CCN, and DMN-AUD connections, which might be a new discovery.
Ages Effects on Brain FC
It is of interest to investigate how brain connectivity changes during adolescence and how it changes across different age stages, e.g., children and young adults, which may further contribute to the study of normal and pathological brain development. Three age groups, 8 to 11 years, 13 to 16 years, and 18 to 22 years, were selected and then distance CCA was applied to each age group to analyze brain network connections. Subjects aged 12 and 17 years were not included in the experiments in order to get a clear boundary between different age groups. The connections between brain subnetworks for each age group are shown in Fig. 10. From Fig. 10, the patterns of the connections are different between different age groups. For instance, the connections between different brain networks are relatively weaker at age 8 to 11 but become relatively stronger at age 13 to 16 and age 18 to 22. It demonstrates that different brain regions become more and more connected during adolescence, which may be a result of the training and development of the brain during multiple types of brain activities. Moreover, it seems that the connections between CCN and SM are weak across all three age groups, which indicates that the connection between CCN and SM may be weak at the adolescent stage.
Discussion and Conclusion
In this work, we proposed a new model, DCCA, which overcomes the limitation of distance correlation in detecting significant associations when feature size is large. Conventional distance correlation analysis needs large-scale multiple testing when testing feature–feature association simultaneously. The proposed model, DCCA, addresses the problem by searching a combination of original features with the highest distance correlation. It is achieved by first constructing a distance kernel function and then solving an optimization problem. The ability to detect nonlinear group–group associations makes DCCA more suitable for analyzing complex multi-omics and imaging-genetic associations, in which both genetic factors and brain ROIs may work as groups when regulating a phenotype or performing a specific brain function.
When applied to imaging-genetic association study, DCCA detected a strong correlation between a subset of genes and a subset of brain ROIs with an improved significance level. Several neuron degeneration and mental disorder related pathways were enriched from the identified genes after gene enrichment analysis, which demonstrated the biological significance of our findings. In addition, DCCA found several mental disorder-related brain networks which had been reported by existing literature. Experiments on brain connectivity study also found several new discoveries using DCCA. Brain network DMN, which is considered to be distinct from other brain domains/networks, may have strong nonlinear connections with other brain networks according to the results of DCCA. When applied to analyzing each age groups, DCCA reveals that younger groups (8 to 11 years) exhibit weak connections of brain networks while the connections become strong at an older age stage (13 to 16 and 18 to 22) which may a result of brain development. The discoveries of imaging genetic associations and brain connections verified the performance of DCCA. Besides the examples in this study, it may find more applications in multiimaging and multi-omics studies, where identifying correlations between multiple datasets is a common challenge.
The authors have no relevant financial interests in the paper and no other potential conflicts of interest to disclose.
The authors would like to thank the NIH (P30 GM122734, R01 GM109068, R01 MH104680, R01 MH107354, P20 GM103472, R01 REB020407, and R01 EB006841) and NSF (#1539067) for partial support.
Wenxing Hu received his BS degree in applied mathematics from Xi’an Jiaotong University, China, 2011. Now, he is a PhD student in biomedical engineering, Tulane University, USA. His research interests include machine learning and deep learning, dimension reduction, correlation analysis, and multi-omics data integration.
Aiying Zhang received her BS degree in statistics from the University of Science and Technology of China. She is now a PhD student in the Department of Biomedical Engineering, Tulane University. Her research interests mainly focus on graphical models (directed and undirected) with applications in multi-omics data integration.
Biao Cai received his BS and MS degrees in biomedical engineering from Tianjin University, China, in 2013 and 2016, respectively. Now, he is a PhD student in biomedical engineering, Tulane University, USA. His research interests include dictionary learning and time-varying graphical LASSO, dynamic function network connectivity, and brain development.
Vince Calhoun is currently president for the MRN, and a distinguished professor in the ECE Department, University of New Mexico. He has published over 600 journal articles. His work includes ICA-based fMRI analysis, and data fusion of multimodal-imaging and genetics data. He leads an NIH P20 COBRE grant on multimodal imaging of mental disorders and an NSF EPSCoR grant focused on brain imaging and epigenetics of adolescent development. He is a fellow of the American Association for the Advancement of Science, the American Institute of Biomedical and Medical Engineers, the American College of Neuropsychopharmacology, and the International Society of Magnetic Resonance in Medicine.
Yu-Ping Wang received his BS degree from Tianjin University in 1990, and his MS and PhD degrees from Xian Jiaotong University in 1993 and 1996, respectively. He is currently a professor of biomedical engineering at Tulane University. His research interests include computer vision, signal processing, and machine learning with applications to biomedical imaging and bioinformatics, where he has published about 200 publications. He has served on numerous NSF/NIH review panels, and as editor for several journals.