Western University Computer ScienceWestern Science

PhD Defense

 

Xiang Li

Classification with Large Sparse Datasets: Convergence Analysis and Scalable Algorithms

 

Date:
Time:
Place:
Supervisor:
Thesis Examiners:

Extra-Departmental
Examiner:
External Examiner:
Monday, July 24, 2017
9:30 a.m.
Middlesex College, Room 320
Dr. Charles Ling
Dr. John Barron
Dr. Mike Bauer

Dr. Xianbin Wang (ECE)
Dr. Xiaodan Zhu

 

Abstract:

Large and sparse datasets, such as user ratings over a large collection of items, are common in the big data era. Many applications need to classify the users or items based on the high-dimensional and sparse data vectors, e.g., to predict the profitability of a product or the age group of a user, etc. Linear classifiers are popular choices for classifying such datasets because of their efficiency. In order to classify the large sparse data more effectively, the following important questions need to be answered.

1. Sparse data and convergence behavior. How different properties of a dataset, such as the sparsity rate and the mechanism of missing systematically affect convergence behavior of classification?

2. Handle sparse data with non-linear model. How to efficiently learn the non-linear data structures when classifying large sparse data?

This thesis attempts to address these questions with empirical and theoretical analysis on large and sparse datasets. We begin by studying the convergence behavior of popular classifiers on large and sparse data. It is known that a classifier gains better generalization ability after learning more and more training examples. Eventually, it will converge to the best generalization performance with respect to a given data distribution. In the thesis, we focus on how sparsity rate, the mechanism of missing systematically affect such convergence behavior. Our study covers different types of classification models, including generative classifier and discriminative linear classifiers. To systematically explore the convergence behaviors, we use synthetic data sampled from statistical models of real-world large sparse datasets. We consider different types of data missing mechanisms that are common in practice. From the experiments, we have several useful observations about the convergence behavior of classifying large sparse data. Based on these observations, we further investigate the theoretic reasons and come to a series of useful conclusions. For better applicability, we provide practical guidelines for applying our results in practice. Our study helps to answer if obtaining more data or missing values in the data is worthwhile in different situations, which is useful for efficient data collection and preparation.

Despite being efficient, linear classifiers cannot learn the non-linear structures such as the low-rankness in a dataset. As a result, its accuracy may suffer. Meanwhile, most non-linear methods such as the kernel machines can hardly scale to very large and high-dimensional datasets. The third part of the thesis studies how to efficiently learn non-linear structures in the large sparse data. Towards this goal, we develop novel scalable feature mappings that can achieve better accuracy than linear classification. We demonstrate that the proposed methods not only outperform linear classification but is also scalable to large and sparse datasets with moderate memory and computation requirement.

To conclude, the main contribution of this thesis is to answer important questions on classifying large and sparse datasets. On the one hand, we study the convergence behavior of widely used classifiers under different data missing mechanisms; on the other hand, we develop efficient methods to learn the non-linear structures in the large sparse data and improve classification accuracy. Overall, the thesis not only provides practical guidance for the convergence behavior of classifying large sparse datasets, but also develops highly efficient algorithms for classifying large sparse datasets in practice.