ADAPTIVE REPRODUCIBLE HIGH-DIMENSIONAL NONLINEAR INFERENCE FOR BIG BIOLOGICAL DATA

Objective

Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications,such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machinelearning methods that can produce both interpretable and reproducible results. Thus, it is of paramountimportance to identify crucial causal factors that are responsible for the response from a large number ofavailable covariates, which can be statistically formulated as the false discovery rate (FDR) control ingeneral high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomicstudies, most existing investigations concentrate on the study of bacterial organisms. However, virusesand virus-host interactions play important roles in controlling the functions of the microbial communities. Inaddition, viruses have been shown to be associated with complex diseases. Yet, investigations into theroles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is todevelop mathematically rigorous and computationally efficient approaches to deal with highly complex bigdata and the applications of these approaches to solve fundamental and important biological andbiomedical problems. There are four interrelated aims. In Aim 1, we will theoretically investigate the powerof the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified tocontrol FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustnessof MFK with respect to the misspecification of covariate distribution. These studies will lay the foundationsfor our developments in other aims. In Aim 2, we will develop deep learning approaches to predict viralcontigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motifdiscovery, and investigate the power and robustness of our new procedure. In Aim 3, we will take intoaccount the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predictingvirus-host infectious interaction status. In Aim 4, we will apply the developed methods from the first threeaims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-hostinteractions associated with several diseases at some target FDR level. Both the algorithms and resultswill be disseminated through the web. The results from this study will be important for metagenomicsstudies under a variety of environments.

Investigators

Fan, Yingying

Institution

University of Southern California

Start date

2018

End date

2022

Funding Source

Nat'l. Inst. of General Medical Sciences

Project number

1R01GM131407-01

Accession number

131407