Title: | Implementation of the DetMCD Algorithm (Robust and Deterministic Estimation of Location and Scatter) |
---|---|
Description: | Implementation of DetMCD, a new algorithm for robust and deterministic estimation of location and scatter. The benefits of robust and deterministic estimation are explained in Hubert, Rousseeuw and Verdonck (2012) <doi:10.1080/10618600.2012.672100>. |
Authors: | Vakili Kaveh [aut, cre], Mia Hubert [ths] |
Maintainer: | Vakili Kaveh <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.5 |
Built: | 2024-11-23 04:01:24 UTC |
Source: | https://github.com/cran/DetMCD |
This packages contains various robust and deterministic algorithms for data analysis.
Package: | DetMCD |
Type: | Package |
Version: | 0.0.1 |
Date: | 2012-09-19 |
Depends: | matrixStats, pcaPP (>= 1.8-1), robustbase, MASS |
License: | GPL (>= 2) |
LazyLoad: | yes |
Index:
DetMCD Robust and deterministic estimation of location and scatter. DetMCD-package Robust and Deterministic Algothms for Data Analysis plot.DetMCD Diagnostic plots for DetMCD DetMCD_CS Internal function for DetMCD DetMCD_RW Internal function for DetMCD DetMCD_SP Internal function for DetMCD xtractR_M Internal function for DetMCD quanff Internal function for DetMCD
Kaveh Vakili [aut, cre],
Maintainer: Kaveh Vakili <[email protected]>
Computes a robust and deterministic multivariate location and scatter
estimate with a high breakdown point, using the DetMCD
(Deterministic
Minimum Covariance Determinant) algorithm.
DetMCD(X,h=NULL,alpha=0.75,scale_est="Auto",tol=1e-07)
DetMCD(X,h=NULL,alpha=0.75,scale_est="Auto",tol=1e-07)
X |
a numeric matrix or data frame. Missing values (NaN's) and infinite values (Inf's) are allowed: observations (rows) with missing or infinite values will automatically be excluded from the computations. |
alpha |
Ignored if |
h |
numeric integer parameter controlling the size of the subsets over
which the determinant is minimized, i.e., |
scale_est |
a character string specifying the
variance functional. Possible values are "qn", "tau" and 'Auto".
Default value |
tol |
a small positive numeric value to be used for determining numerical 0. |
DetMCD computes the MCD estimator of a multivariate data set in a deterministic way. This estimator is given by the subset of h observations with smallest covariance determinant. The MCD location estimate is then the mean of those h points, and the MCD scatter estimate is their covariance matrix. The default value of h is roughly 0.75n (where n is the total number of observations), but the user may choose each value between n/2 and n. Based on the raw estimates, weights are assigned to the observations such that outliers get zero weight. The reweighted MCD estimator is then given by the mean and covariance matrix of the cases with non-zero weight.
To compute the MCD estimator, six initial robust h-subsets are constructed based on robust transformations of variables or robust and fast-to-compute estimators of multivariate location and shape. Then C-steps are applied on these h-subsets until convergence. Note that the resulting algorithm is not fully affine equivariant, but it is often faster than the FAST-MCD algorithm which is affine equivariant. Note that this function can not handle exact fit situations: if the raw covariance matrix is singular, the program is stopped. In that case, it is recommended to apply the FastMCD function.
The MCD method is intended for continuous variables, and assumes that the number of observations n is at least 5 times the number of variables p. If p is too large relative to n, it would be better to first reduce p by variable selection or robust principal components (see the functions PcaHubert).
A list with components:
raw.center |
The raw MCD location of the data. |
raw.cov |
The raw MCD covariance matrix (multiplied by a consistency factor). |
crit |
The determinant of the raw MCD covariance matrix. |
raw.rd |
The robust distance of each observation to the raw MCD center, relative to the raw MCD scatter estimate. |
raw.wt |
Weights based on the estimated raw covariance matrix 'raw.cov' and the estimated raw location 'raw.center' of the data. These weights determine which observations are used to compute the final MCD estimates. |
center |
The robust location of the data, obtained after reweighting. |
cov |
The robust covariance matrix, obtained after reweighting. |
h |
The number of observations that have determined the MCD estimator, i.e. the value of h. |
which.one |
The identifier of the initial shape estimate which led to the optimal result. |
best |
The subset of h points whose covariance matrix has minimal determinant. |
weights |
The finale vector of weights. |
rd |
The robust distance of each observation to the final, reweighted MCD center of the data, relative to the reweighted MCD scatter of the data. These distances allow us to easily identify the outliers. |
rew.md |
The Mahalanobis distance of each observation (distance from the classical center of the data, relative to the classical shape of the data). |
X |
Same as the X in the call to DetMCD, without rows containing missing or infinite values. |
alpha |
The vector of values of alpha used in the algorithm. |
scale_est |
The vector of scale estimators used in the estimates (one of |
Vakili Kaveh (includes section of the help file from the LIBRA implementation).
Hubert, M., Rousseeuw, P.J. and Verdonck, T. (2012), "A deterministic algorithm for robust location and scatter", Journal of Computational and Graphical Statistics, Volume 21, Number 3, Pages 618–637.
Verboven, S., Hubert, M. (2010). Matlab library LIBRA, Wiley Interdisciplinary Reviews: Computational Statistics, 2, 509–515.
## generate data set.seed(1234) # for reproducibility alpha<-0.5 n<-101 p<-5 #generate correlated data D<-diag(rchisq(p,df=1)) W<-matrix(0.9,p,p) diag(W)<-1 W<-D V<-chol(W) x<-matrix(rnorm(n*p),nc=p) x<-scale(x) result<-DetMCD(x,scale_est="tau",alpha=alpha) plot(result, which = "dd") #compare to robustbase: result<-DetMCD(x,scale_est="qn",alpha=alpha) resultsRR<-covMcd(x,nsamp='deterministic',scalefn=qn,alpha=alpha) #should be the same: result$crit resultsRR$crit #Example with several values of alpha: alphas<-seq(0.5,1,l=6) results<-DetMCD(x,scale_est="qn",alpha=alphas) plot(results, h.val = 2, which = "dd")
## generate data set.seed(1234) # for reproducibility alpha<-0.5 n<-101 p<-5 #generate correlated data D<-diag(rchisq(p,df=1)) W<-matrix(0.9,p,p) diag(W)<-1 W<-D V<-chol(W) x<-matrix(rnorm(n*p),nc=p) x<-scale(x) result<-DetMCD(x,scale_est="tau",alpha=alpha) plot(result, which = "dd") #compare to robustbase: result<-DetMCD(x,scale_est="qn",alpha=alpha) resultsRR<-covMcd(x,nsamp='deterministic',scalefn=qn,alpha=alpha) #should be the same: result$crit resultsRR$crit #Example with several values of alpha: alphas<-seq(0.5,1,l=6) results<-DetMCD(x,scale_est="qn",alpha=alphas) plot(results, h.val = 2, which = "dd")
Internal function. Computes the Csteps for the DetMCD algorithm.
DetMCD_CS(Data,scale_est,h,out1)
DetMCD_CS(Data,scale_est,h,out1)
Data |
a numeric matrix or data frame without missing values. |
scale_est |
a character string specifying the
variance functional. Possible values are |
h |
a vector of integers (between n/2 and n). |
out1 |
A list. Typically the result of a call to DetMCD_SP. |
returns a list.
Vakili Kaveh
Internal function. Carries the re-weighting part of the DetMCD algorithm.
DetMCD_RW(ll,hlst,Xw,out2,scale_est,alpha)
DetMCD_RW(ll,hlst,Xw,out2,scale_est,alpha)
ll |
integer in |
hlst |
a vector of integers between in (n/2,n). |
Xw |
a n by p data matrix. |
out2 |
a list. Typically the result of a call to |
scale_est |
a character string specifying the
variance functional. Possible values are |
alpha |
a vector of values in [1/2,1]. |
returns a list.
Vakili Kaveh
Internal function. Computes the starting points for the DetMCD algorithm.
DetMCD_SP(Data,scale_est,tol)
DetMCD_SP(Data,scale_est,tol)
Data |
a numeric matrix or data frame without missing values. |
scale_est |
a character string specifying the
variance functional. Possible values are |
tol |
a small positive numeric value to be used for determining numerical 0. |
returns a list.
Vakili Kaveh
Test function for the qn used in DetR.
inQn(x)
inQn(x)
x |
Vector of 2 or more numbers. Should contain no ties. |
the value of the qn estimator of scale.
Kaveh Vakili
see pcaPP::qn
and citation("pcaPP").
set.seed(123) #for reproductibility x<-rnorm(101) inQn(x) #should be the same: pcaPP::qn(x)
set.seed(123) #for reproductibility x<-rnorm(101) inQn(x) #should be the same: pcaPP::qn(x)
Shows the Mahalanobis distances based on robust and classical estimates of the location and the covariance matrix in different plots. The following plots are available:
index plot of the robust and mahalanobis distances
distance-distance plot
Chisquare QQ-plot of the robust and mahalanobis distances
plot of the tolerance ellipses (robust and classic)
Scree plot - Eigenvalues comparison plot
This function is a minimally modified adaptation of
"robustbase::covPlot"
. See citation("robustbase")
.
## S3 method for class 'DetMCD' plot(x,h.val=1, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which == "all" && dev.interactive()), cutoff = NULL, id.n, labels.id = rownames(x), cex.id = 0.75, label.pos = c(4,2), tol = 1e-07, ...)
## S3 method for class 'DetMCD' plot(x,h.val=1, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which == "all" && dev.interactive()), cutoff = NULL, id.n, labels.id = rownames(x), cex.id = 0.75, label.pos = c(4,2), tol = 1e-07, ...)
x |
For the |
h.val |
An integer in |
which |
string indicating which plot to show. See the
Details section for a description of the options. Defaults to |
.
classic |
whether to plot the classical distances too.
Defaults to |
.
ask |
logical indicating if the user should be asked
before each plot, see |
cutoff |
the cutoff value for the distances. |
id.n |
number of observations to be identified by a label. If
not supplied, the number of observations with distance larger than
|
labels.id |
vector of labels, from which the labels for extreme
points will be chosen. |
cex.id |
magnification of point labels. |
label.pos |
positioning of labels, for the left half and right
half of the graph respectively (used as |
tol |
tolerance to be used for computing the inverse, see
|
... |
Further arguments passed to the plot function. |
These functions produce several plots based on the robust and classical
location and covariance matrix. Which of them to select is specified
by the attribute which
. The plot
method for
"mcd"
objects is calling covPlot()
directly, whereas
covPlot()
should also be useful for plotting other (robust)
covariance estimates. The possible options are:
distance
index plot of the robust distances
dd
distance-distance plot
qqchi2
a qq-plot of the robust distances versus the quantiles of the chi-squared distribution
tolEllipsePlot
a tolerance ellipse plot, via tolEllipsePlot
screeplot
an eigenvalues comparison plot - screeplot
The Distance-Distance Plot, introduced by Rousseeuw and van Zomeren (1990), displays the robust distances versus the classical Mahalanobis distances. The dashed line is the set of points where the robust distance is equal to the classical distance. The horizontal and vertical lines are drawn at values equal to the cutoff which defaults to square root of the 97.5% quantile of a chi-squared distribution with p degrees of freedom. Points beyond these lines can be considered outliers.
P. J. Rousseeuw and van Zomeren, B. C. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association 85, 633–639.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
data(Animals, package ="MASS") brain <- Animals[c(1:24, 26:25, 27:28),] detmcd <- DetMCD(log(brain)) plot(detmcd, which = "distance", classic = TRUE)# 2 plots plot(detmcd, which = "dd") plot(detmcd, which = "tolEllipsePlot", classic = TRUE) op <- par(mfrow = c(2,3)) plot(detmcd)## -> which = "all" (5 plots) par(op)
data(Animals, package ="MASS") brain <- Animals[c(1:24, 26:25, 27:28),] detmcd <- DetMCD(log(brain)) plot(detmcd, which = "distance", classic = TRUE)# 2 plots plot(detmcd, which = "dd") plot(detmcd, which = "tolEllipsePlot", classic = TRUE) op <- par(mfrow = c(2,3)) plot(detmcd)## -> which = "all" (5 plots) par(op)
Internal function. Converts alpha values to h values.
quanff(alpha,n,p)
quanff(alpha,n,p)
alpha |
a value in [1/2,1]. |
n , p
|
integers. |
returns an integer.
Vakili Kaveh
Hubert, M., Rousseeuw, P.J. and Verdonck, T. (2012), "A deterministic algorithm for robust location and scatter", Journal of Computational and Graphical Statistics, in press.
quanff(0.75,n=100,p=5);
quanff(0.75,n=100,p=5);
Internal function. Formats the output for the DetMCD algorithm.
xtractR_M(out2,X)
xtractR_M(out2,X)
out2 |
A list. Typically the result of a call to DetMCD_RW. |
X |
a numeric matrix or data frame without missing values. |
returns a list.
Vakili Kaveh