440 Huntington Avenue
310F West Village H
Boston, MA 02115
ATTN: Olga Vitek, 202 WVH
360 Huntington Avenue
Boston, MA 02115
- Statistical and computational methods for systems-wide molecular investigations of biological organisms
- PhD in Statistics, Purdue University
- MS in Mathematical Statistics, Purdue University
- BS, University of Geneva, Switzerland
Olga Vitek joined Northeastern University in 2014 with a joint appointment in the College of Science and the Khoury College of Computer Sciences. She was previously named the Sy and Laurie Sternberg Interdisciplinary Associate Professor at Northeastern University.
Prior to joining Northeastern, she was an assistant professor and then a tenured associate professor at Purdue University, with a joint appointment in the Department of Statistics and Department of Computer Science (2006-2014). She interned at Eli Lilly & Company in Indianapolis and held a position of post-doctoral associate in the Aebersold Lab at the Institute for Systems Biology in Seattle.
Vitek’s work develops statistical and computational methods for systems-wide molecular investigations of biological organisms. Her group works with high-throughput large-scale investigations in quantitative genomics, proteomics, metabolomics, and ionomics. This research relies on mass spectrometry and other complementary technologies to characterize the components of the biological systems, their functional interactions, and their relevance to disease. The goal of Vitek’s research is to provide statistical and computational methods and open-source software for design of these experiments, and for accurate and objective interpretation of the resulting large and complex datasets.
Vitek is a recipient of the National Science Foundation CAREER Award. During her time at Purdue University, she was a University Faculty Scholar, as well as recognized with an Outstanding Assistant Professor Teaching Award, a Graduate Student Mentoring Award, and a Teaching for Tomorrow Award. She serves on the board of directors of the U.S. Human Proteome Organization.
MSstats and Cardinal: Next Generation Statistical Mass Spectrometry in R
MSstats and Cardinal: Next Generation Statistical Mass Spectrometry in R
To provide open-source, interoperable, and extensible statistical software for quantitative mass spectrometry, which enables experimentalists and developers of statistical methods to rapidly respond to changes in the evolving biotechnological landscape.
Summer School: Big Data and Statistics for Bench Scientists
Summer School: Big Data and Statistics for Bench Scientists
Northeastern University proposes to organize a Summer School ‘Big Data and Statistics for Bench Scientists.’ The Summer School will train life scientists and computational scientists in designing and analyzing large-scale experiments relying on proteomics, metabolomics, and other high-throughput biomolecular assays. The training will enhance the effectiveness and reproducibility of biomedical research, such as discovery of diagnostic biomarkers for early diagnosis of disease, or prognostic biomarkers for predicting therapy response.
Northeastern University requests funds for a Summer School, entitled Big Data and Statistics for Bench Scientists. The target audience for the School are graduate and post-graduate life scientists, who work primarily in wet lab, and who generate large datasets. Unlike other educational efforts that emphasize genomic applications, this School targets scientists working with other experimental technologies. Mass spectrometry-based proteomics and metabolomics are our main focus, however the School is also appropriate for scientists working with other assays, e.g. nuclear magnetic resonance spectroscopy (NMR), protein arrays, etc. This large community has been traditionally under-served by educational efforts in computation and statistics. This proposal aims to fill this void. The Summer School is motivated by the feedback from smaller short courses previously co-organized or co- instructed by the PI, and will cover theoretical and practical aspects of design and analysis of large-scale experimental datasets. The Summer School will have a modular format, with 8 20-hour modules scheduled in 2 parallel tracks during 2 consecutive weeks. Each module can be taken independently. The planned modules are (1) Processing raw mass spectrometric data from proteomic experiments using Skyline, (2) Begnner’s R, (3) Processing raw mass spectrometric data from metabolomic experiments using OpenMS, (4) Intermediate R, (5) Beginner’s guide to statistical experimental design and group comparison, (6) Specialized statistical methods for detecting differentially abundant proteins and metabolites, (7) Statistical methods for discovery of biomarkers of disease, and (8) Introduction to systems biology and data integration. Each module will introduce the necessary statistical and computational methodology, and contain extensive practical hands-on sessions. Each module will be organized by instructors with extensive interdisciplinary teaching experience, and supported by several teaching assistants. We anticipate the participation of 104 scientists, each taking on average 2 modules. Funding is requested for three yearly offerings of the School, and includes funds to provide US participants with 62 travel fellowships per year, and 156 registration fee wavers per module. All the course materials, including videos of the lectures and of the practical sessions, will be publicly available free of charge.
ABI Innovation: Scalable and Agile Analysis of Mass Spectrometry Experiments
ABI Innovation: Scalable and Agile Analysis of Mass Spectrometry Experiments
Mass spectrometry is a diverse and versatile technology for high-throughput functional characterization of proteins, small molecules and metabolites in complex biological mixtures. The technology rapidly evolves and generates datasets of an increasingly large complexity and size. This rapid evolution must be matched by an equally fast evolution of statistical methods and tools developed for analysis of these data. Ideally, new statistical methods should leverage the rich resources available from over 12,000 packages implemented in the R programming language and its Bioconductor project. However, technological limitations now hinder their adoption for mass spectrometric research. In response, the project ROCKET builds an enabling technology for working with large mass spectrometric datasets in R, and rapidly developing new algorithms, while benefiting from advancements in other areas of science. It also offers an opportunity of recruitment and retention of Native American students to work with R-based technology and research, and helps prepare them in a career in STEM.
Instead of implementing yet another data processing pipeline, ROCKET builds an enabling technology for extending the scalability of R, and streamlining manipulations of large files in complex formats. First, to address the diversity of the mass spectrometric community, ROCKET supports scaling down analyses (i.e., working with large data files on relatively inexpensive hardware without fully loading them into memory), as well as scaling up (i.e., executing a workflow on a cloud or on a multiprocessor). Second, ROCKET generates an efficient mixture of R and target code which is compiled in the background for the particular deployment platform. By ensuring compatibility with mass spectrometry-specific open data storage standards, supporting multiple hardware scenarios, and generating optimized code, ROCKET enables the development of general analytical methods. Therefore, ROCKET aims to democratize access to R-based data analysis for a broader community of life scientists, and create a blueprint for a new paradigm for R-based computing with large datasets. The outcomes of the project will be documented and made publicly available at https://olga-vitek-lab.khoury.northeastern.edu/.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.
MSstatsQC 2.0: R/Bioconductor package for statistical quality control of mass spectrometry-based proteomic experiments
E. Dogu, S. Mohammad-Taheri, R. Olivella, F. Marty, I. Lienert, L. Reiter, E. Sabidó, O. Vitek. “MSstatsQC 2.0: R/Bioconductor package for statistical quality control of mass spectrometry-based proteomic experiments”. Journal of Proteome Research, 18:678, 2019.
MSstatsQC is an R/Bioconductor package for statistical monitoring of longitudinal system suitability and quality control in mass spectrometry-based proteomics. MSstatsQC was initially designed for targeted selected reaction monitoring experiments. This paper presents an extension, MSstatsQC 2.0, that supports experiments with global data-dependent and data-independent acquisition. The extension implements data processing and analyses that are specific to these acquisition types. It relies on state-of-the-art methods of statistical process control to detect deviations from optimal performance of various metrics (such as intensity and retention time of chromatographic peaks) and to summarize the results across multiple metrics and analytes. Additionally, the web-based graphical user interface MSstatsQCgui, implemented as a separate R/Bioconductor package, provides a user-friendly way to visualize and report the results from MSstatsQC 2.0.
Ness R.O., Sachs K., Mallick P., Vitek O. (2017) A Bayesian Active Learning Experimental Design for Inferring Signaling Networks. In: Sahinalp S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science, vol 10229. Springer, Cham
Machine learning methods for learning network structure, applied to quantitative proteomics experiments, reverse-engineer intracellular signal transduction networks. They provide insight into the rewiring of signaling within the context of a disease or a phenotype. To learn the causal patterns of influence between proteins in the network, the methods require experiments that include targeted interventions that fix the activity of specific proteins. However, the interventions are costly and add experimental complexity.
We describe a active learning strategy for selecting optimal interventions. Our approach takes as inputs pathway databases and historic datasets, expresses them in form of prior probability distributions on network structures, and selects interventions that maximize their expected contribution to structure learning. Evaluations on simulated and real data show that the strategy reduces the detection error of validated edges as compared to an unguided choice of interventions, and avoids redundant interventions, thereby increasing the effectiveness of the experiment.
Galitzine C., Beltran P.M.J., Cristea I.M., Vitek O. (2018) Statistical Inference of Peroxisome Dynamics. In: Raphael B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science, vol 10812. Springer, Cham
The regulation of organelle abundance sustains critical biological processes, such as metabolism and energy production. Biochemical models mathematically express these temporal changes in terms of reactions, and their rates. The rate parameters are critical components of the models, and must be experimentally inferred. However, the existing methods for rate inference are limited, and not directly applicable to organelle dynamics.
This manuscript introduces a novel approach that integrates modeling, inference and experimentation, and incorporates biological replicates, to accurately infer the rates. The approach relies on a biochemical model in form of a stochastic differential equation, and on a parallel implementation of inference with particle filter. It also relies on a novel microscopy workflow that monitors organelles over long periods of time in cell culture. Evaluations on simulated datasets demonstrated the advantages of this approach in terms of increased accuracy and shortened computation time. An application to imaging of peroxisomes determined that fission, rather than de novo generation, is predominant in maintaining the organelle level under basal conditions. This biological insight serves as a starting point for a system view of organelle regulation in cells.
Cyril Galitzine, Jarrett D. Egertson, Susan Abbatiello, Clark M. Henderson, Lindsay K. Pino, Michael MacCoss, Andrew N. Hoofnagle, Olga Vitek Molecular & Cellular Proteomics May 1, 2018, First published on February 9, 2018, 17 (5) 913-924
The need for assay characterization is ubiquitous in quantitative mass spectrometry-based proteomics. Among many assay characteristics, the limit of blank (LOB) and limit of detection (LOD) are two particularly useful figures of merit. LOB and LOD are determined by repeatedly quantifying the observed intensities of peptides in samples with known peptide concentrations and deriving an intensity versus concentration response curve. Most commonly, a weighted linear or logistic curve is fit to the intensity-concentration response, and LOB and LOD are estimated from the fit. Here we argue that these methods inaccurately characterize assays where observed intensities level off at low concentrations, which is a common situation in multiplexed systems. This manuscript illustrates the deficiencies of these methods, and proposes an alternative approach based on nonlinear regression that overcomes these inaccuracies. We evaluated the performance of the proposed method using computer simulations and using eleven experimental data sets acquired in Data-Independent Acquisition (DIA), Parallel Reaction Monitoring (PRM), and Selected Reaction Monitoring (SRM) mode. When the intensity levels off at low concentrations, the nonlinear model changes the estimates of LOB/LOD upwards, in some data sets by 20–40%. In absence of a low concentration intensity leveling off, the estimates of LOB/LOD obtained with nonlinear statistical modeling were identical to those of weighted linear regression. We implemented the nonlinear regression approach in the open-source R-based software MSstats, and advocate its general use for characterization of mass spectrometry-based assays.
D. Guo, K. A. Bemis, C. Rawlins, J. Agar, and O. Vitek. “Unsupervised segmentation of mass spectrometric ion images characterizes morphology of tissues.” Bioinformatics. 2019
Mass spectrometry imaging (MSI) characterizes the spatial distribution of ions in complex biological samples such as tissues. Since many tissues have complex morphology, treatments and conditions often affect the spatial distribution of the ions in morphology-specific ways. Evaluating the selectivity and the specificity of ion localization and regulation across morphology types is biologically important. However, MSI lacks algorithms for segmenting images at both single-ion and spatial resolution.
This article contributes spatial-Dirichlet Gaussian mixture model (DGMM), an algorithm and a workflow for the analyses of MSI experiments, that detects components of single-ion images with homogeneous spatial composition. The approach extends DGMMs to account for the spatial structure of MSI. Evaluations on simulated and experimental datasets with diverse MSI workflows demonstrated that spatial-DGMM accurately segments ion images, and can distinguish ions with homogeneous and heterogeneous spatial distribution. We also demonstrated that the extracted spatial information is useful for downstream analyses, such as detecting morphology-specific ions, finding groups of ions with similar spatial patterns, and detecting changes in chemical composition of tissues between conditions.
Michail Schwab, Sicheng Hao, Olga Vitek, James Tompkin, Jeff Huang, and Michelle A. Borkin. 2019. Evaluating Pan and Zoom Timelines and Sliders. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Paper 556, 12 pages. DOI: https://doi.org/10.1145/3290605.3300786
Pan and zoom timelines and sliders help us navigate large time series data. However, designing efficient interactions can be difficult. We study pan and zoom methods via crowd-sourced experiments on mobile and computer devices, asking which designs and interactions provide faster target acquisition. We find that visual context should be limited for low-distance navigation, but added for far-distance navigation; that timelines should be oriented along the longer axis, especially on mobile; and that, as compared to default techniques, double click, hold, and rub zoom appear to scale worse with task difficulty, whereas brush and especially ortho zoom seem to scale better. Software and data used in this research are available as open source.
Protein biomarkers on tissue as imaged via MALDI mass spectrometry: A systematic approach to study the limits of detection.
van de Ven, S. M. W. Y., Bemis, K. D., Lau, K., Adusumilli, R., Kota, U., Stolowitz, M., et al. (2016). Protein biomarkers on tissue as imaged via MALDI mass spectrometry: A systematic approach to study the limits of detection. Proteomics, 16(11-12), 1660–1669. http://doi.org/10.1002/pmic.201500515
MALDI mass spectrometry imaging (MSI) is emerging as a tool for protein and peptide imaging across tissue sections. Despite extensive study, there does not yet exist a baseline study evaluating the potential capabilities for this technique to detect diverse proteins in tissue sections. In this study, we developed a systematic approach for characterizing MALDI-MSI workflows in terms of limits of detection, coefficients of variation, spatial resolution, and the identification of endogenous tissue proteins. Our goal was to quantify these figures of merit for a number of different proteins and peptides, in order to gain more insight in the feasibility of protein biomarker discovery efforts using this technique. Control proteins and peptides were deposited in serial dilutions on thinly sectioned mouse xenograft tissue. Using our experimental setup, coefficients of variation were <30% on tissue sections and spatial resolution was 200 μm (or greater). Limits of detection for proteins and peptides on tissue were in the micromolar to millimolar range. Protein identification was only possible for proteins present in high abundance in the tissue. These results provide a baseline for the application of MALDI-MSI towards the discovery of new candidate biomarkers and a new benchmarking strategy that can be used for comparing diverse MALDI-MSI workflows.
Statistical detection of differentially abundant ions in mass spectrometry-based imaging experiments with complex designs
K. A. Bemis, D. Guo, A. Harry, M. Thomas, I. Lanekoff, M. Stenzel-Poore, S. Stevens, J. Laskin, and O. Vitek. “Statistical detection of differentially abundant ions in mass spectrometry-based imaging experiments with complex designs.” International Journal of Mass Spectrometry. 2019
Mass Spectrometry Imaging (MSI) characterizes changes in chemical composition between regions of biological samples such as tissues. One goal of statistical analysis of MSI experiments is class comparison, i.e. determining analytes that change in abundance between conditions more systematically than as expected by random variation. To reach accurate and reproducible conclusions, statistical analysis must appropriately reflect the initial research question, the design of the MSI experiment, and all the associated sources of variation. This manuscript highlights the importance of following these general statistical principles. Using the example of two case studies with complex experimental designs, and with different strategies of data acquisition, we demonstrate the extent to which choices made at key points of this workflow impact the results, and provide suggestions for appropriate design and analysis of MSI experiments that aim at detecting differentially abundant analytes.
Tsung-Heng Tsai, Zhiqi Hao, Qiuting Hong, Benjamin Moore, Cinzia Stella, Jeffrey H. Zhang, Yan Chen, Michael Kim, Theo Koulis, Gregory A. Ryslik, Erik Verschueren, Fred Jacobson, William E. Haskins & Olga Vitek
Peptide mapping with liquid chromatography–tandem mass spectrometry (LC-MS/MS) is an important analytical method for characterization of post-translational and chemical modifications in therapeutic proteins. Despite its importance, there is currently no consensus on the statistical analysis of the resulting data. In this manuscript, we distinguish three statistical goals for therapeutic protein characterization: (1) estimation of site occupancy of modifications in one condition, (2) detection of differential site occupancy between conditions, and (3) estimation of combined site occupancy across multiple modification sites. We propose an approach, which addresses these goals in terms of summarizing the quantitative information from the mass spectra, statistical modeling, and model-based analysis of LC-MS/MS data. We illustrate the approach using an LC-MS/MS experiment from an antibody-drug conjugate and its monoclonal antibody intermediate. The performance was compared to a ‘naïve’ data analysis approach, by using computer simulation, evaluation of differential site occupancy in positive and negative controls, and comparisons of estimated site occupancy with orthogonal experimental measurements of N-linked glycoforms and total oxidation. The results demonstrated the importance of replicated studies of protein characterization, and of appropriate statistical modeling, for reproducible, accurate and efficient site occupancy estimation and differential analysis.
Kylie A. Bemis and Olga Vitek
We introduce matter, an R package for direct interactions with larger-than-memory datasets, stored in an arbitrary number of files of any size. matter is primarily designed for datasets in new and rapidly evolving file formats, which may lack extensive software support. matter enables a wide variety of data exploration and manipulation steps and is extensible to many bioinformatics applications. It supports reproducible research by minimizing the need of converting and storing data in multiple formats. We illustrate the performance of matter in conjunction with the Bioconductor package Cardinal for analysis of high-resolution, high-throughput mass spectrometry imaging experiments.
From Correlation to Causality: Statistical Approaches to Learning Regulatory Relationships in Large-Scale Biomolecular Investigations
R. Ness, K. Sachs, O. Vitek. “From correlation to causality: statistical approaches to learning regulatory relationships in large-scale biomolecular investigations”. Journal of Proteome Research, in press, 2016.
Causal inference, the task of uncovering regulatory relationships between components of biomolecular pathways and networks, is a primary goal of many high-throughput investigations. Statistical associations between observed protein concentrations can suggest an enticing number of hypotheses regarding the underlying causal interactions, but when do such associations reflect the underlying causal biomolecular mechanisms? The goal of this perspective is to provide suggestions for causal inference in large-scale experiments, which utilize high-throughput technologies such as mass-spectrometry-based proteomics. We describe in nontechnical terms the pitfalls of inference in large data sets and suggest methods to overcome these pitfalls and reliably find regulatory associations.
K. D. Bemis, A. Harry, L. S. Eberlin, C. Ferreira, S. M. van de Ven, P. Mallick, M. Stolowitz, O. Vitek. “Cardinal: an R package for statistical analysis of mass spectrometry-based imaging experiments”. Bioinformatics, 31:2418, 2015.
Cardinal is an R package for statistical analysis of mass spectrometry-based imaging (MSI) experiments of biological samples such as tissues. Cardinal supports both Matrix-Assisted Laser Desorption/Ionization (MALDI) and Desorption Electrospray Ionization-based MSI workflows, and experiments with multiple tissues and complex designs. The main analytical functionalities include (1) image segmentation, which partitions a tissue into regions of homogeneous chemical composition, selects the number of segments and the subset of informative ions, and characterizes the associated uncertainty and (2) image classification, which assigns locations on the tissue to pre-defined classes, selects the subset of informative ions, and estimates the resulting classification error by (cross-) validation. The statistical methods are based on mixture modeling and regularization.
Probabilistic Segmentation of Mass Spectrometry (MS) Images Helps Select Important Ions and Characterize Confidence in the Resulting Segments
Kyle D. Bemis, April Harry, Livia S. Eberlin, Christina R. Ferreira, Stephanie M. van de Ven, Parag Mallick, Mark Stolowitz and Olga Vitek
Mass spectrometry imaging is a powerful tool for investigating the spatial distribution of chemical compounds in a biological sample such as tissue. Two common goals of these experiments are unsupervised segmentation of images into newly discovered homogeneous segments and supervised classification of images into predefined classes. In both cases, the important secondary goals are to characterize the uncertainty associated with the segmentation and with the classification and to characterize the spectral features that define each segment or class. Recent analysis methods have focused on the spatial structure of the data to improve results. However, they either do not address these secondary goals or do this with separate post hoc procedures.
We introduce spatial shrunken centroids, a statistical model-based framework for both supervised classification and unsupervised segmentation. It takes as input sets of previously detected, aligned, quantified, and normalized spectral features and expresses both spatial and multivariate nature of the data using probabilistic modeling. It selects informative subsets of spectral features that define each unsupervised segment or supervised class and quantifies and visualizes the uncertainty in spatial segmentations and in tissue classification. In the unsupervised setting, it also guides the choice of an appropriate number of segments. We demonstrate the usefulness of this framework in a supervised human renal cell carcinoma experimental dataset and several unsupervised experimental datasets, including a pig fetus cross-section, three rodent brains, and a controlled image with known ground truth. This framework is available for use within the open-source R package Cardinal as part of a full pipeline for the processing, visualization, and statistical analysis of mass spectrometry imaging experiments.
MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments
M. Choi, C.-Y. Chang, T. Clough, D. Broudy, T. Killeen, B. MacLean, O. Vitek. “MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments”. Bioinformatics, 30:2524, 2014.
MSstats is an R package for statistical relative quantification of proteins and peptides in mass spectrometry-based proteomics. Version 2.0 of MSstats supports label-free and label-based experimental workflows and data-dependent, targeted and data-independent spectral acquisition. It takes as input identified and quantified spectral peaks, and outputs a list of differentially abundant peptides or proteins, or summaries of peptide or protein relative abundance. MSstats relies on a flexible family of linear mixed models.
C.-Y. Chang, E. Sabidó, R. Aebersold, O. Vitek. “Targeted protein quantification using sparse reference labeling”. Nature Methods, 11:301, 2014.
Targeted proteomics is a method of choice for accurate and high-throughput quantification of predefined sets of proteins. Many workflows use isotope-labeled reference peptides for every target protein, which is time consuming and costly. We report a statistical approach for quantifying full protein panels with a reduced set of reference peptides. This label-sparse approach achieves accurate quantification while reducing experimental cost and time. It is implemented in the software tool SparseQuant.
Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size
D. Yu, W. Huber, O. Vitek. “Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size”. Bioinformatics, 29:1275, 2013.
RNA-seq experiments produce digital counts of reads that are affected by both biological and technical variation. To distinguish the systematic changes in expression between conditions from noise, the counts are frequently modeled by the Negative Binomial distribution. However, in experiments with small sample size, the per-gene estimates of the dispersion parameter are unreliable.
We propose a simple and effective approach for estimating the dispersions. First, we obtain the initial estimates for each gene using the method of moments. Second, the estimates are regularized, i.e. shrunk towards a common value that minimizes the average squared difference between the initial estimates and the shrinkage estimates. The approach does not require extra modeling assumptions, is easy to compute and is compatible with the exact test of differential expression.
We evaluated the proposed approach using 10 simulated and experimental datasets and compared its performance with that of currently popular packages edgeR, DESeq, baySeq, BBSeq and SAMseq. For these datasets, sSeq performed favorably for experiments with small sample size in sensitivity, specificity and computational time.
A. L. Oberg, O. Vitek. “Statistical design of quantitative mass spectrometry-based proteomic experiments”. Journal of Proteome Research, 8:2144, 2009
We review the fundamental principles of statistical experimental design, and their application to quantitative mass spectrometry-based proteomics. We focus on class comparison using Analysis of Variance (ANOVA), and discuss how randomization, replication and blocking help avoid systematic biases due to the experimental procedure, and help optimize our ability to detect true quantitative changes between groups. We also discuss the issues of pooling multiple biological specimens for a single mass analysis, and calculation of the number of replicates in a future study. When applicable, we emphasize the parallels between designing quantitative proteomic experiments and experiments with gene expression microarrays, and give examples from that area of research. We illustrate the discussion using theoretical considerations, and using real-data examples of profiling of disease.