An official website of the United States government.

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Statistical Design and the Analysis of Gene Expression Microarray Data

Cui, Xinping
University of California - Davis
Start date
End date
The goal of this project is to develop computationally efficient statistical and data mining methods and software for integrating, modeling and analyzing biological data generated at the genome scale (denoted as omic data) to facilitate the understanding of how genetics and environment interact to influence phenotypes of crops and plants, how the dynamic mechanisms of molecular network control dynamics of gene, environment and phenotype cross talk, and how microbes interact with environment and/or host crops/plants.

The specific objectives and expected outputs are as follows: (1) using "omic" phenotypic data and genome-wide genetic markers to build multi-level multivariate statistical models for genomic selection and genome wide association studies. The ancillary biological information, such as linkage disequilibrium, distribution of QTL, marker density and subpopulation structure, will also be incorporated in the model building process. These models can then be used to predict phenotypes based solely on whole-genome genotypes in related populations, facilitating selection for major genes in plant breeding. This work will also advance a few domains in statistics, including variable selection, multiple hypothesis testing, Bayes and empirical Bayes.
(2) using known biological mechanisms to develop partial differential equations (PDE) for modeling molecular network controlling plant growth process at temporal and spatial resolutions that are relevant to the dynamics of gene, environment and phenotype cross talk. Parameters in these PDE models will also be estimated from the experimental data. Such predictive modeling of plant growth can be used to predict how a plant will react to environmental stress and/or to perturbations in its genome, allowing us to change internal process in the plants to alter their biological behavior. This work will also advance a few domains in mathematics and statistics, including numerical solution of nonlinear differential equation with algebraic constraints, optimization methods and nonparametric estimation theory and application.
(3) developing tri-clustering of trivariate scatter surfaces and extending it to multi-dimensional clustering of multivariate data for elucidating interactions between microbes and their environment and/or host. This work will also advance multivariate statistics theory and application, unsupervised data mining, and data visualization.

More information
Non-Technical Summary:
It is increasingly clear today that the available traditional methods of crop improvement are not sufficient to provide enough and staple food grains to the constantly growing world population. This situation is projected to be worse by the year 2050 especially in context of climate change. Fresh water shortages and loss of natural forests to agriculture are predicted to grow substantially worldwide in the next decades. An array of new innovative "omic" tools, comprising of genomics, proteomics, metabolomics, phenomics, epigenomics, metagenomcis can provide promising and empowering contributions to solutions of these problems, for which biologically meaningful and computationally efficient statistical methods are indispensable components. While the omics data are accumulating at an exponential rate, interpretation of the "omics" data lags far behind. Omics data, inherently associated with extremely large volume, high dimensionality and significant noise, has now outstripped the capability of traditional statistical methods and data visualization tools. The project will address the need of timely development of automated approaches for "omics" data analysis. This project will help accelerate discover and enhance the first-hand knowledge in how plants function, how various processes control plant growth and development, and how they respond to the environmental stress. With such knowledge, we can provide leadership in developing and implanting novel transgenic control strategies in crop/plants or applying plants/crops own defense mechanisms to enhance resistance to environmental stress and/or diseases, and in rapidly developing new crop varieties that are disease resistant with improved quality and productivity. This work will also help discover and advance knowledge on the microbe-host/environment interaction. With such knowledge, we can provide leadership in the development of farming practices that take advantage of the natural alliances among microbes and plants.

For objective (1), A multivariate linear mixed effects model will be developed for multivariate genome-wide association studies (GWAS) and genomic selection, which utilize multivariate phenotypes (phenomics). First, I will investigate dimension reduction technique such as principle component analysis (PCA) to identify the number of dimensions of the phenotypes that hold significant genetic variation. Second, I will investigate variable selection methods such as least angle regression (LARS) for marker selection. Third, penalized approach and priors will be incorporated in LARS to account for linkage disequilibrium (LD), distribution of QTLs, marker density and subpopulation structure. For example, some statistical penalty function can be used to penalize the difference of the genetic effects at adjacent SNPs with high correlation, subpopulation structure can be incorporated by some prior distributions. Fourth, the statistical estimation and multiple hypothesis testing procedure will be developed for the final multivariate linear mixed effects model.

For objective (2), we will develop a second order Ingegro Differential Equation (IDE) containing two parameters to model pollen tube tip growth. We will first prove the existence and uniqueness of the solution of IDE based on their connection to the semilunar elliptic equation. For a specific form of IDE considered in this work, we will develop a parametric parameter estimation method. We will then develop a general Ortholognal Condition (OC) nonparametric method that can be used to estimate parameters for various forms of IDE.

For objective (3), we will first investigate cluster 3-dimensional scatter surfaces that reside in a trivariate cube. The quality index we used for bi-clustering of scatter plots can be used to compare any two trivariate distributions corresponding to two cells in the cube. Therefore we can extend the bi-clustering algorithm to tri-clustering by first starting one of the three dimensions and find seeds along that dimension and then expanding the seeds along the second dimension, which generate bi-cluster seeds and then expanding the bi-cluster seeds along the third dimension. Clearly, the above algorithm can be easily extended to multi-dimensional clustering of multivariate data with more than three dimensions. The special challenge we will face, however, is the visualization of the clustering results. Our strategy will therefore be to perform the multidimensional clustering, but only viewing two or three dimensions at a time. We will develop a few painting indices to allow visualizing the clustering results through use of heat maps. Overall, the soundness of each statistical procedure will be evaluated both by statistical theory and computer simulation. The final statistical framework will then be applied to experimental data for validation and refinement. Computerized statistics software will be freely and timely disseminated to research communities through ( oftware). I will also actively interact with AES-CE colleagues and CE specialists for method validation and modification.

2012/01 TO 2012/12
OUTPUTS: Activity: (a) My lab has developed a new data depth based co-clustering statistical method and theory for biclustering bivariate scatter plot in bioinformatics, such as microarray gene expression data and high-throughput proteomics data. We also proposed novel painting metrics and constructed heat maps to allow visualization of the co-clusters. (b)My lab collaborated with Dr. Jeske and Dr. Mark Hoddle for a co-clustering method to cluster spatial data using a generalized linear mixed model with application to the integrated pest management. (c)My lab has extended our GeMS SNP caller on single sample high-throughput sequencing (HTS) data to MetaGeMS SNP caller on multiple sample HTS data. We have been performing simulation studies to evaluate the sensitivity and specificity of our method. (d)My lab has developed a new statistical method with ultra-efficient computing algorithm for robust and accurate base calling on HTS data, which in turn will improve the downstream SNP calling accuracy.
Event: (a)I gave an invited talk "Base calling and SNP calling on next generation sequencing data" in summer 2012 Joint Statistical Meeting. (b)I organized an invited session "Statistical and Computational Challenges in Metagenomic Analysis of Next-generation Sequencing Data" in summer 2012 Joint Statistical Meeting. (c)I organized an invited session "Recent advances in statistical methods for genetic and genomic studies" in summer 2012 Second Joint Biostatistics Symposium. (d)I gave a workshop " Polymorphism detection using microarray data and next generation sequencing data with application to eQTL mapping and genome-wide association studies" at China Sun Yat-sen University and Nianjing Agriculture University in summer 2012. (e)I gave a poster presentation "Co-clustering Scatter Plots Using Data Depth Measures" at the six international workshop on machine learning in system biology sponsored by International Society of Computational Biology. (f)My student received travel fund and gave an oral presentation " GeMS: HTS SNP calling which accounts for sample preparation errors" at the International Society of Computational Biology sponsored European ISCB Student Council Symposium.
Service: (a)We collaborated with Dr. Harkamal Walia from the Department of Agronomy and Horticulture at the University of Nebraska-Lincoln and applied robustifed projection pursuit method developed by my lab to detect single feature polymorphisms for wheat. (b)We continued collaborating with UC Davis genome center and performed microarray-based single position polymorphism (SPP) detection on Lettuce. (c)We continued collaborating with Prof. Hailin Jin from UCR Department of Plant Pathology for studying the function of the targets of those pathogen-regulated small RNAs via microarray and RNA-seq analysis. Products We developed the new statistical package "scatter-plot-biclustering" which has been disseminated through the website ftware/scatter-plot-biclustering Dissemination The workshop given at China Sun Yat-sen University and Nianjing Agriculture University in summer 2012 are outreach activities.
PARTICIPANTS: Individuals: Gabriel Murillo: Role: graduate student researcher He is the key developer of GeMS and MetaGeMS package, he performed all the simulation studies and real data analysis. Training and Professional Development: Gabriel received BS in Mathematics and is currently working on his Ph.D. in applied statistics. This research provided him multi-disciplinary training experience in statistics, biology and computer science. Zhanpan: Role: graduate student researcher He is the key developer of co-clustering method of bivariate scatter plot data matrix and Model-based spatial co-clustering method. Training and Professional Development: This research provides Zhanpan an outstanding training opportunity of interdisciplinary nature involving modern statistics, biology and computer science. Partner Organization: UC Davis Genome Center and Seed Biotechnology Center Collaborator: James Borneman, Department of Plant Pathology Hailing Jin Department of Botany and Plant Sciences Mark Hoddle Department of Entomology Harkamal Walia Department of Agronomy and Horticulture University of Nebraska-Lincoln.
TARGET AUDIENCES: Nothing significant to report during this reporting period.
PROJECT MODIFICATIONS: This project is going to expire in Septmeber 2013, a renewed project will be submitted and extend developing statistical tools for microarray data analysis to developing statistical tools for omics data analysis.

IMPACT: Change in knowledge: (1)Current existing SNP callers on high-throughput sequencing data only consider source of allele errors due to base-calling and alignment. However, our close examination revealed that genomic sample preparation errors can also have significant impact on the power and accuracy of SNP detection on single sample HTS data, the problem becomes worse for multiple sample HTS data. Therefore, any SNP caller should carefully consider as many sources of errors as possible. (2)In Statistics, data depth has been traditionally used in outlier detection and classification; we brought the new application of data depth to clustering of high-dimensional data. (3)Model-based spatial co-clustering has not been well studied in Integrated Pest Management. The combination of the spatial co-clustering technique with a statistical inference method makes assessment of pest density more accurately.
Change in action: (1)At the beginning, we have been using popular tools such as SAMtools, GATK, FreeBayes for SNP calling on high-throughput sequencing data. Since the successful development of our new SNP caller, we have been using it for all of our SNP calling projects. We also worked very hard to disseminate our new knowledge and product to change the other users' action by publishing papers, giving presentations at conferences and making our software freely available online. (2)Realizing that data depth has new application in clustering, we developed new statistical theory of this new application (3)With model-based spatial coclustering, local infected region with crop orchards can be identified. Only treating the infested regions instead of the whole orchard can reduce pest management costs and minimize potential hazards to the environments.

Funding Source
Nat'l. Inst. of Food and Agriculture
Project source
View this project
Project number
Accession number
Whole Genome Sequencing