The noise component in model-based clustering.

2008 
Model-based cluster analysis is a statistical tool used to investigate group-structures in data. Finite mixtures of Gaussian distributions are a popular device used to model elliptical shaped clusters. Estimation of mixtures of Gaussians is usually based on the maximum likelihood method. However, for a wide class of finite mixtures, including Gaussians, maximum likelihood estimates are not robust. This implies that a small proportion of outliers in the data could lead to poor estimates and clustering. One way to deal with this is to add a "noise component", i.e. a mixture component that models the outliers. In this thesis we explore this approach based on three contributions. First, Fraley and Raftery (1993) propose a Gaussian mixture model with the addition of a uniform noise component with support on the data range. We generalize this approach by introducing a model, which is a finite mixture of location-scale distributions mixed with a finite number of uniforms supported on disjoint subsets of the data range. We study identifiability and maximum likelihood estimation, and provide a computational procedure based on the EM algorithm. Second, Hennig (2004) proposed a sort of model in which the noise component is represented by a fixed improper density, which is a constant on the real line. He shows that the resulting estimates are robust to extreme outliers. We define a maximum likelihood type estimator for such a model and study its asymptotic behaviour. We also provide a method for choosing the improper constant density, and a computational procedure based on the EM algorithm. The third contribution is an extensive simulation study in which we measure the performance of the previous two methods and certain other robust method ologies proposed in the literature.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    8
    Citations
    NaN
    KQI
    []