Publications

Critical assessment of variant prioritization methods for rare disease diagnosis within the Rare Genomes Project

Published in medRxiv, 2023

A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average “diagnostic odyssey” lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting.

Recommended citation: Stenton, Sarah L., et al. "Critical assessment of variant prioritization methods for rare disease diagnosis within the Rare Genomes Project." medRxiv (2023).

Download Here

The Impact of Genomic Variation on Function (IGVF) Consortium

Published in arXiv, 2023

Our genomes influence nearly every aspect of human biology from molecular and cellular functions to phenotypes in health and disease. Human genetics studies have now associated hundreds of thousands of differences in our DNA sequence (“genomic variation”) with disease risk and other phenotypes, many of which could reveal novel mechanisms of human biology and uncover the basis of genetic predispositions to diseases, thereby guiding the development of new diagnostics and therapeutics. Yet, understanding how genomic variation alters genome function to influence phenotype has proven challenging. To unlock these insights, we need a systematic and comprehensive catalog of genome function and the molecular and cellular effects of genomic variants. Toward this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations, and predictive modeling to investigate the relationships among genomic variation, genome function, and phenotypes. Through systematic comparisons and benchmarking of experimental and computational methods, we aim to create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how both coding and noncoding variants may connect through gene regulatory and protein interaction networks. These experimental data, computational predictions, and accompanying standards and pipelines will be integrated into an open resource that will catalyze community efforts to explore genome function and the impact of genetic variation on human biology and disease across populations.

Recommended citation: Consortium, I. G. V. F. "The Impact of Genomic Variation on Function (IGVF) Consortium." arXiv preprint arXiv:2307.13708 (2023).

Download Here

Leveraging Structure for Improved Classification of Grouped Biased Data

Published in AAAI, 2023

We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partitionprojected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.

Recommended citation: Zeiberg, Daniel, Shantanu Jain, and Predrag Radivojac. "Leveraging structure for improved classification of grouped biased data." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 9. 2023.

Download Here

Multi-objective prioritization of genes for high-throughput functional assays towards improved clinical variant classification

Published in Pacific Symposium on Biocomputing, 2023

The accurate interpretation of genetic variants is essential for clinical actionability. However, a majority of variants remain of uncertain significance. Multiplexed assays of variant effects (MAVEs), can help provide functional evidence for variants of uncertain significance (VUS) at the scale of entire genes. Although the systematic prioritization of genes for such assays has been of great interest from the clinical perspective, existing strategies have rarely emphasized this motivation. Here, we propose three objectives for quantifying the importance of genes each satisfying a specific clinical goal: (1) Movability scores to prioritize genes with the most VUS moving to non-VUS categories, (2) Correction scores to prioritize genes with the most pathogenic and/or benign variants that could be reclassified, and (3) Uncertainty scores to prioritize genes with VUS for which variant pathogenicity predictors used in clinical classification exhibit the greatest uncertainty. We demonstrate that existing approaches are sub-optimal when considering these explicit clinical objectives. We also propose a combined weighted score that optimizes the three objectives simultaneously and finds optimal weights to improve over existing approaches. Our strategy generally results in better performance than existing knowledge-driven and data-driven strategies and yields gene sets that are clinically relevant. Our work has implications for systematic efforts that aim to iterate between predictor development, experimentation and translation to the clinic.

Recommended citation: Chen, Yile, et al. "Multi-objective prioritization of genes for high-throughput functional assays towards improved clinical variant classification." PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023: Kohala Coast, Hawaii, USA, 3–7 January 2023. 2022.

Download Here

Classification in biological networks with hypergraphlet kernels

Published in Bioinformatics, 2020

Biological and cellular systems are often modeled as graphs in which vertices represent objects of interest (genes, proteins, drugs) and edges represent relational ties between these objects (binds-to, interacts-with, regulates). This approach has been highly successful owing to the theory, methodology and software that support analysis and learning on graphs. Graphs, however, suffer from information loss when modeling physical systems due to their inability to accurately represent multi-object relationships. Hypergraphs, a generalization of graphs, provide a framework to mitigate information loss and unify disparate graph-based methodologies. We present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs. We then introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. The method is based on exact and inexact (via hypergraph edit distances) enumeration of hypergraphlets; i.e., small hypergraphs rooted at a vertex of interest. We empirically evaluate this method on fifteen biological networks and show its potential use in a positive-unlabeled setting to estimate the interactome sizes in various species.

Recommended citation: Lugo-Martinez, Jose, et al. "Classification in biological networks with hypergraphlet kernels." Bioinformatics 37.7 (2021): 1000-1007.

Download Here

Github Repo

Fast Nonparametric Estimation of Class Proportions in the Positive-Unlabeled Classification Setting

Published in AAAI, 2020

Estimating class proportions has emerged as an important direction in positive-unlabeled learning. Well estimated class priors are key to accurate approximation of posterior distributions and are necessary for the recovery of true classification performance. While significant progress has been made in the past decade, there remains a need for accurate strategies that scale to big data. Motivated by this need, we propose an intuitive and fast nonparametric algorithm to estimate class proportions. Unlike any of the previous methods, our algorithm uses a sampling strategy to repeatedly (1) draw an example from the set of positives, (2) record the minimum distance to any of the unlabeled examples, and (3) remove the nearest unlabeled example. We show that the point of sharp increase in the recorded distances corresponds to the desired proportion of positives in the unlabeled set and train a deep neural network to identify that point. Our distance-based algorithm is evaluated on forty datasets and compared to all currently available methods. We provide evidence that this new approach results in the most accurate performance and can be readily used on large datasets.

Recommended citation: Zeiberg, Daniel, Shantanu Jain, and Predrag Radivojac. "Fast nonparametric estimation of class proportions in the positive-unlabeled classification setting." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

Download Here

Github Repo

Machine learning for patient risk stratification for acute respiratory distress syndrome

Published in PlosOne, 2019

Existing prediction models for acute respiratory distress syndrome (ARDS) require manual chart abstraction and have only fair performance–limiting their suitability for driving clinical interventions. We sought to develop a machine learning approach for the prediction of ARDS that (a) leverages electronic health record (EHR) data, (b) is fully automated, and (c) can be applied at clinically relevant time points throughout a patient’s stay.

Recommended citation: Zeiberg, Daniel, et al. "Machine learning for patient risk stratification for acute respiratory distress syndrome." PloS one 14.3 (2019): e0214465.

Download Here