SEM217: Jeongyoun Ahn, University of Georgia

Tuesday, April 19th @ 11:00-12:30 PM 

Detecting Outliers in HDLSS Data

Jeongyoun Ahn, University of Georgia

Abstract: High-throughput data are usually a product of long and complex experiments in laboratories or fields. Due to the multi-step process when generating data, a concern for possible contamination in high-dimensional data is naturally more severe than low-dimensional counterparts.  We propose a new two-stage procedure for detecting multiple outliers when the dimension of the data is much larger than the available sample size. In the first stage, the data are split into two disjoint sets, one containing non-outliers and the other containing the rest of the data that are considered as potential outliers. In the second stage, a series of hypothesis tests are conducted to test the abnormality of each candidate outlier. A nonparametric test based on uniform random rotations is adopted for hypothesis testing. The power of the proposed test is studied under a high-dimensional asymptotic framework and its finite-sample exactness is established under mild conditions. Numerical studies based on simulated examples and face recognition data suggest that the proposed approach is superior to the existing methods, especially in terms of false identification of non-outliers.

Recording | Slides (PDF file)