Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development. Unfortunately, data anonymization techniques have largely ignored diversity in its published result. This inadvertently propagates underlying bias in subsequent data analysis. We study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints. We formalize diversity constraints and study their foundations such as implication and satisfiability. We show that determining the existence of a diverse, anonymized instance can be done in PTIME, and we present the DIVA algorithm to compute a DIVerse and Anonymized relation. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. Our work aligns with recent trends towards responsible data science by coupling diversity with privacy-preserving data publishing.
Pastwatch is a data summarization, explanation and visualization framework for the provenance of aggregate queries. Data provenance includes any information about the origin of a piece of data and the process that led to its creation. The provenance of a query over a database is the data in the database that contributed to the query answer. For aggregate queries that apply mathematical functions, such as sum and average, the provenance of a query answer usually contains a large number of database records which makes it difficult for a database user to explore and understand it. Pastwatch facilitates database access by providing provenance summarization of queries, which helps users to understand the query answers. No more comments but this is not the end of project.
Data cleaning is a pervasive problem for organizations as they try to reap value from their data. Recent advances in networking and cloud computing technology have fueled a new computing paradigm called Database-as-a-Service, where data management tasks are outsourced to large service providers. In this project, we consider a Data Cleaning-as-a-Service model that allows a client to interact with a data cleaning provider who hosts curated, and sensitive data. We present PACAS: a Privacy-Aware data Cleaning-As-a-Service model that facilitates interaction between the parties with client query requests for data, and a service provider using a data pricing scheme that computes prices according to data sensitivity. We propose new extensions to the model to define generalized data repairs that obfuscate sensitive data to allow data sharing between the client and service provider. We present a new semantic distance measure to quantify the utility of such repairs, and we re-define the notion of consistency in the presence of generalized values. The PACAS model uses (X, Y, L)-anonymity that extends existing data publishing techniques to consider the semantics in the data while protecting sensitive values. Our evaluation over real data show that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques.
CurrentClean is a probabilistic system for the detection and cleaning of stale data. It learns spatio-temporal update patterns for values in a database via past update queries. CurrentClean applies inference rules to model the causal and co-occurrence update patterns seen in real data and estimates currency of values and recommends spatio-temporal-aware repairs for stale values. We applied several optimization techniques that improve the inference run-time in the system and we conducted extensive experiments and studied CurrentClean’s comparative accuracy to detect stale values in real data, as well as its repair effectiveness.
Formulating efficient SQL queries requires several cycles of tuning and execution, particularly for inexperienced users. We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We achieve this by predicting properties such as the query answer size, its run-time, and error class. Unlike existing approaches, our approach does not rely on any statistics from the database instance or query execution plans. This is particularly important in settings with limited access to the database instance. Our approach is based on using data-driven machine learning techniques that rely on large query workloads to model SQL queries and their properties. We evaluate the utility of neural network models and traditional machine learning models. We use two real-world query workloads: the Sloan Digital Sky Survey (SDSS) and the SQLShare query workload. Empirical results show that the neural network models are more accurate in predicting the query error class, achieving a higher F-measure on classes with fewer samples as well as performing better on other problems such as run-time and answer size prediction. These results are encouraging and confirm that SQL query workloads and data-driven machine learning methods can be leveraged to facilitate query composition and analysis.