A new statistical method offers a more efficient way to uncover biologically meaningful changes in genomic data that span multiple conditions -; such as cell types or tissues.
Studies of the whole genome produce vast amounts of data, ranging from millions of individual DNA sequences, to information about where and how many of the thousands of genes are expressed, to the location of functional elements throughout the genome. Due to the volume and complexity of the data, comparing different biological conditions or studies performed by different laboratories can be statistically difficult.
The difficulty with multiple conditions is analyzing the data together in a way that is both statistically meaningful and computationally efficient. Existing methods are computationally intensive or produce results that are difficult to interpret biologically. We have developed a method called CLIMB that improves on existing methods, is computationally efficient, and provides biologically interpretable results. We test the method on three types of genomic data collected from hematopoietic cells -; related to blood stem cells -; but the method could also be used in analyzes of other ‘omic’ data.”
Qunhua Li, Associate Professor of Statistics, Penn State
The researchers describe the CLIMB (Composite LIkelihood eMpirical Bayes) method in an article appearing online Nov. 12 in the journal nature communication.
“In experiments where there is so much information from relatively few people, it helps to use information as efficiently as possible,” said Hillary Koch, a graduate student at Penn State at the time of the research and now senior statistician at Moderna. “There are statistical advantages to being able to look at everything together and even using information from related experiments. CLIMB enables us to do just that.”
The CLIMB method uses principles from two traditional techniques to analyze data across multiple conditions. One technique uses a series of pairwise comparisons between conditions, but becomes increasingly difficult to interpret as additional conditions are added.
Another technique combines each subject’s pattern of activity across conditions into an “association vector,” e.g., a gene that is upregulated, downregulated, or unaltered in each of many cell types. The association vector directly reflects the condition specificity pattern and is easy to interpret. However, since many different combinations are possible even with just a few conditions, the calculations are extremely computationally intensive. To overcome this challenge, this second approach alone makes assumptions about how to simplify the data, which are not always correct.
“CLIMB uses aspects of both of these approaches,” says Koch. “We ultimately analyze association vectors, but first we use pairwise analysis to identify the patterns that are likely to exist in advance. Rather than making assumptions about the data, we use the pairwise information to eliminate combinations that the data do not strongly support. This dramatically reduces the space of possible patterns across conditions that would otherwise make the calculations so intense.
After compiling the reduced set of possible association vectors, the method groups subjects that follow the same pattern across conditions. For example, the results could show researchers sets of genes that are collectively upregulated in some cell types but downregulated in others.
The researchers tested their method on data collected from experiments using a technology called RNA-seq, which can measure the amount of RNA made from all of the genes expressed in a cell, to investigate whether specific genes help determine what cell types the hematopoietic stem cell eventually transforms into.
“Compared to the popular pairwise method, our results are more specific,” Li said. “Our gene list is more concise and biologically relevant.”
While the traditional pairwise method identified six to seven thousand genes of interest, CLIMB produced a much narrower list of two to three thousand genes, with at least a thousand of those genes identified in both analyses.
“The different blood cell types have a variety of functions — some become red blood cells and some become immune cells — and we wanted to know which genes were more likely to be involved in determining each cell type,” said Ross Hardison, T. Ming Chu professor of Biochemistry and Molecular Biology at Penn State. “The CLIMB approach pulled out some important genes; some of them we already knew, others add to our knowledge.
Researchers also used CLIMB on data obtained from another experimental technology, ChIP-seq, which can identify where along the genome certain proteins bind to DNA. They studied how the binding of a protein called CTCF-; a transcription factor that helps establish interactions required for gene regulation in the nucleus -; varies or not across 17 cell populations all derived from the same hematopoietic stem cell. The CLIMB analysis identified different categories of CTCF-bound sites, some showing the role of this transcription factor in all blood cells and others showing a role in specific cell types.
Finally, the team examined data from another experimental technology called DNase-seq, which can identify locations of regulatory regions to compare chromatin accessibility -; a complex of DNA and proteins -; in 38 human cell types.
“For all three tests, we wanted to see if our results had biological relevance, so we compared our results to independent data, such as studies of high-throughput sequencing of histone modifications and transcription factor footprinting,” said Koch. “Our results are consistent with each of these other methods. Next, we want to improve the computational speed of our method and increase the number of conditions it can handle. For example, chromatin accessibility data is available for many more cell types, so we’d like to expand the scope of CLIMB.”
In addition to Li, Koch, and Hardison, the research team includes Cheryl Keller, Guanjue Xiang, and Belinda Giardine from Penn State, Feipeng Zhang from Xi’an Jiaotong University in China, and Yicheng Wang from the University of British Columbia in Canada. This research was supported by the National Institutes of Health, including the National Institute of General Medical Sciences, the National Human Genome Research Institute, and the National Institute of Diabetes and Digestive and Kidney Diseases.
cook, h et al. (2022) CLIMB: High-dimensional association detection in large-scale genomic data. nature communication. doi.org/10.1038/s41467-022-34360-z.
#CLIMB #method #offers #efficient #uncover #biologically #meaningful #genomic #data