Mia Hubert, Peter J. Rousseeuw and Wannes Van den Bossche Department of Mathematics, KU Leuven, Belgium
Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, i.e. rows that deviate from the majority of the rows in the data (for instance, they might belong to a diff t population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down (Alqallaf et al., 2009).
In this paper a new PCA method is constructed which combines the strengths of two existing robust methods, DetectDeviatingCells (Rousseeuw and Van den Bossche, 2019) and ROBPCA, in order to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missings And Cellwise & Rowwise Outliers. Several simulations and real data sets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control. The function MacroPCA has been incorporated in the R package cellWise (Raymaekers et al., 2019), which also contains a vignette with real data examples.
Alqallaf, F., Van Aelst, S., Yohai, V., and Zamar, R.H. (2009), Propagation of outliers in multivariate data, The Annals of Statistics, 37, 311–331.
Raymaekers, J., Rousseeuw, P.J., Van den Bossche, W., and Hubert, M. (2019). cellWise: Analyzing Data with Cellwise Outliers. R package, CRAN.
Rousseeuw, P.J., and Van den Bossche, W. (2019). Detecting Deviating Data Cells. Technometrics, 60, 123–145.