440 Huntington Avenue
450 West Village H
Boston, MA 02115
ATTN: Wolfgang Gatterbauer, 202 WVH
360 Huntington Avenue
Boston, MA 02115-5000
- Data Management
- PhD in Computer Science, Technical University of Vienna
- MS in Technology & Policy and Electrical Engineering & Computer Science, Massachusetts Institute of Technology
- BS/MS in Mechanical Engineering, Technical University of Graz
Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences. Prior to joining Northeastern University, he was an Assistant Professor in the Tepper school of Business at Carnegie Mellon University; and before that a PostDoc in the database group of University of Washington. His main research interests are data and information management. His work was published in venues such as SIGMOD, VLDB, AAAI, and WWW. He is a recipient of the NSF CAREER award and “best-of-conference” mentions from VLDB 2015 and SIGMOD 2017.
CAREER: Scaling Approximate Inference and Approximation – Aware Learning
CAREER: Scaling Approximate Inference and Approximation – Aware Learning
The last decade has seen an enormous increase in our ability to gather and manage large amounts of data; business, healthcare, education, economy, science, and almost every aspect of society are accumulating data at unprecedented levels. The basic premise is that by having more data, even if uncertain and of lower quality, we are also able to make better-informed decisions. To make any decisions, we need to perform “inference” over the data, i.e. to either draw new conclusions, or to find support for existing hypotheses, thus allowing us to favor one course of action over another. However, general reasoning under uncertainty is highly intractable, and many state-of-the-art systems today perform approximate inference by reverting to sampling. Thus for many modern applications (such as information extraction, knowledge aggregation, question-answering systems, computer vision, and machine intelligence), inference is a key bottleneck, and new methods for tractable approximate inference are needed.
This project addresses the challenge of scaling inference by generalizing two highly scalable approximate inference methods and complementing them with scalable methods for parameter learning that are “approximation-aware.” Thus, instead of treating the (i) learning and the (ii) inference steps separately, this project uses the approximation methods developed for inference also for learning the model. The research hypothesis is that this approach increases the overall end-to-end prediction accuracy while simultaneously increasing scalability. Concretely, the project develops the theory and a set of scalable algorithms and optimization methods for at least the following four sub-problems: (1) approximating general probabilistic conjunctive queries with standard relational databases; (2) learning the probabilities in uncertain databases based on feedback on rankings of output tuples from general queries; (3) approximating the exact probabilistic inference in undirected graphical models with linearized update equations; and (4) complementing the latter with a robust framework for learning linearized potentials from partially labeled data.
Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, and Martin Theobald. 2019. Anytime Approxima- tion in Probabilistic Databases via Scaled Dissociations. In 2019 International Conference on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3299869.3319900
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing proba- bilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMC- style sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model- based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.
Li Chou, Wolfgang Gatterbauer, and Vibhav Gogate. 2018. Dissociation-based oblivious bounds for weighted model counting. UAI.
We consider the weighted model counting task which includes important tasks in graphical models, such as computing the partition function and probability of evidence as special cases. We propose a novel partition-based bounding algorithm that exploits logical structure and gives rise to a set of inequalities from which upper (or lower) bounds can be derived efficiently. The bounds come with optimality guarantees under certain conditions and are oblivious in that they require only limited observations of the structure and parameters of the problem. We experimentally compare our bounds with the mini-bucket scheme (which is also oblivious) and show that our new bounds are often superior and never worse on a wide variety of benchmark networks.
Siyuan Wu, Leong Hou U., Sourav S. Bhowmick, and Wolfgang Gatterbauer. 2018. PISTIS: A Conflict of Interest Declaration and Detection System for Peer Review Management. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 1713-1716. DOI: https://doi.org/10.1145/3183713.3193552
Detecting conflicts of interest (COIs) is key for guaranteeing the fairness of a peer-review process. In many conference management systems, the COIs of authors and reviewers are self-declared, and the declaration process is time consuming and potentially incomplete. To address this problem, we demonstrate a novel interactive system called PISTIS that assists the declaration process in a semi-automatic manner. Apart from keyword search and simple filtering, our system provides an interactive graphical interface that helps users explore potential COIs based on the heterogenous data sources. To simply the process of declaration, we also recommend latent COIs using a supervised ranking model that can be iteratively refined from the data collected from past declarations. We believe that PISTIS can be useful as an assistant tool in many real world conference management systems.
Jiao Y., Ravi R., Gatterbauer W. (2017) Algorithms for Automatic Ranking of Participants and Tasks in an Anonymized Contest. In: Poon SH., Rahman M., Yen HC. (eds) WALCOM: Algorithms and Computation. WALCOM 2017. Lecture Notes in Computer Science, vol 10167. Springer, Cham
We introduce a new set of problems based on the Chain Editing problem. In our version of Chain Editing, we are given a set of anonymous participants and a set of undisclosed tasks that every participant attempts. For each participant-task pair, we know whether the participant has succeeded at the task or not. We assume that participants vary in their ability to solve tasks, and that tasks vary in their difficulty to be solved. In an ideal world, stronger participants should succeed at a superset of tasks that weaker participants succeed at. Similarly, easier tasks should be completed successfully by a superset of participants who succeed at harder tasks. In reality, it can happen that a stronger participant fails at a task that a weaker participants succeeds at. Our goal is to find a perfect nesting of the participant-task relations by flipping a minimum number of participant-task relations, implying such a “nearest perfect ordering” to be the one that is closest to the truth of participant strengths and task difficulties. Many variants of the problem are known to be NP-hard.
We propose six natural k-near versions of the Chain Editing problem and classify their complexity. The input to a k-near Chain Editing problem includes an initial ordering of the participants (or tasks) that we are required to respect by moving each participant (or task) at most k positions from the initial ordering. We obtain surprising results on the complexity of the six k-near problems: Five of the problems are polynomial-time solvable using dynamic programming, but one of them is NP-hard.
Niccolò Meneghetti, Oliver Kennedy, and Wolfgang Gatterbauer. 2018. Learning From Query-Answers: A Scalable Approach to Belief Updating and Parameter Learning. ACM Trans. Database Syst. 43, 4, Article 17 (December 2018), 41 pages. DOI: https://doi.org/10.1145/3277503
Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases(D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.
Niccolo' Meneghetti, Oliver Kennedy, and Wolfgang Gatterbauer. 2017. Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 573-586. DOI: https://doi.org/10.1145/3035918.3064026
Tuple-independent probabilistic databases (TI-PDBs) handle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database derives the marginal probabilities of each output-tuple, assuming input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of performing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algorithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.
Wolfgang Gatterbauer. 2017. The linearization of belief propagation on pairwise markov random fields. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI'17). AAAI Press 3747-3753.
Belief Propagation (BP) is a widely used approximation for exact probabilistic inference in graphical models, such as Markov Random Fields (MRFs). In graphs with cycles, however, no exact convergence guarantees for BP are known, in general. For the case when all edges in the MRF carry the same symmetric, doubly stochastic potential, recent works have proposed to approximate BP by linearizing the update equations around default values, which was shown to work well for the problem of node classification. The present paper generalizes all prior work and derives an approach that approximates loopy BP on any pairwise MRF with the problem of solving a linear equation system. This approach combines exact convergence guarantees and a fast matrix implementation with the ability to model heterogenous networks. Experiments on synthetic graphs with planted edge potentials show that the linearization has comparable labeling accuracy as BP for graphs with weak potentials, while speeding-up inference by orders of magnitude.
Dissociation and propagation for approximate lifted inference with standard relational database management systems
Gatterbauer, W. & Suciu, D. The VLDB Journal (2017) 26: 5. https://doi.org/10.1007/s00778-016-0434-5
Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.