## APPLICATIONS OF RANDOM MATRIX THEORY IN STATISTICS AND MACHINE LEARNING

ABSTRACT
We live in an age of big data. Analyzing modern data sets can be very difficult because they usually present the following features: massive, high-dimensional, and heterogeneous. How to deal with these new features often plays a key role in modern statistical and machine learning research. This dissertation uses random matrix theory (RMT), a powerful mathematical tool, to study several important problems where the data is massive, high-dimensional, and sometimes heterogeneous.

The first chapter briefly introduces some basics of random matrix theory (RMT). We also cover some classical applications of RMT to statistics and machine learning.

The second chapter is about distributed linear regression, where we consider the ordinary least squares (OLS) estimators. Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. We study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation and test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework.

The third chapter studies a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning, and has several optimality properties,

thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high dimensional random-effects model where each predictor has a small effect, we discover several new phenomena. We also propose a new Weighted ONe-shot DistributEd Ridge regression (WONDER) algorithm. We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy.

The fourth chapter is trying to solve another possible issue with modern data sets, that is heterogeneity. Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of “empirical null” data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high- dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we propose to leverage recent breakthroughs in random matrix theory, such as dimension- free operator norm bounds and large deviations for the top eigenvalues of nonhomogeneous matrices. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.

CHAPTER ONE
Introduction
Random Matrix Theory (RMT) traces back to the early days of statistical sciences in 1920s (Wishart) and the development of quantum mechanics in 1950s (Wigner). In quantum mechanics, the energy levels of a quantum system are described by eigenvalues of a Hermitian operator on a Hilbert space. Since the operator is infinite-dimensional, it is common to approximate the system by discretization. Hence, the limiting behavior of large dimensional random matrices has attracted special interest among physicists working in quantum mechanics. For more work on applications of RMT in physics, one can refer to Mehta (2004).

Statistics has entered into a new age where an increasingly larger volume of more complex data is being generated everyday. This brings the so-called high-dimensional data that are frequently associated with new phenomena beyond the boundary of classical multivariate statistics. Hence, RMT has emerged as a particularly useful framework and mathematical tool for formulating and answering many theoretical questions associated with the analysis of modern high-dimensional data. We will not spend too much time and effort on introducing rigorous definitions and mathematical details of RMT, since there are already many good references including review papers like Johnstone (2007); Paul and Aue (2014) and textbooks like Bai and Silverstein (2010); Anderson et al. (2010); Yao et al. (2015)....

===================================================================
Item Type: Project Material  |  Size: 141 pages  |  Chapters: 1-5
Format: MS Word  |  Delivery: Within 30Mins.
===================================================================

## Search for your topic here

See full list of Project Topics under your Department Here!

## Featured Post

### HOW TO WRITE A RESEARCH HYPOTHESIS

A hypothesis is a description of a pattern in nature or an explanation about some real-world phenomenon that can be tested through observ... 