Mixture Model of Gaussian copulas to cluster mixed data
Description:
VarSelLCM achieves the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this mixture model allows to model intra-component dependencies like a Gaussian mixture, so with standard correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and provides tools of visualization based on the model parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler.
Tool functions (summary and MixClusVisu) facilitate the result interpretation.
This section reproduces Example 2.2 of the preprint
First, the cluster analysis is performed with two components, then a model summary is given (partition and parameters). Finally, the cluster interpretation can be done based on the graphical and numerical presentations of the parameters.
Loadings
rm(list=ls())
set.seed(135)
library(MixCluster)
Loading required package: MCMCpack
Loading required package: coda
Loading required package: lattice
Loading required package: MASS
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2015 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
Loading required package: tmvtnorm
Loading required package: mvtnorm
Loading required package: stats4
Loading required package: gmm
Loading required package: sandwich
Loading required package: msm
Loading required package: plotrix
# Loading of a dataset simulated from a bi-component mixture model of Gaussian copulas
# (see Example 2.2 page 6)
# The first column indicates the class membership
# The last three column are used for the clustering
data(simu)
Clustering of mixed variables
# Cluster analysis by the bi-component mixture model of Gaussian copulas
# without constrain between the correlation matrices
res.mixclus <- MixClusClustering(simu[,-1], 2)
Confusion matrix
# Confusion matrix between the estimated (row) and the true (column) partition
table(res.mixclus@data@partition, simu[,1])
1 2
1 0 101
2 97 2
Model Summary
# Summary of the model
summary(res.mixclus)
**************************************************************
DATA SET description:
* Number of individuals: 200
* Number of variables (continuous, integer, ordinal): 1, 1, 1
**************************************************************
MODEL description:
* Number of classes: 2
* Model: MixClusModel_hetero
* Log-likelihood: -975.908
* BIC: -1015.645
* ICL: -1016.189
**************************************************************
PARAMETERS description:
* Class 1
* Proportion 0.5049973
* Margins parameters of the continuous variables (Gaussian distribution):
Mean Sd
continuous 2.2 1
* Margins parameters of the integer variables (Poisson distribution):
Mean
integer 16
* Margins parameters of the ordinal variables (multinomial distribution):
Prob.0 Prob.1
binary 0.52 0.48
* Correlation matrix:
[,1] [,2] [,3]
[1,] 1.00 0.788 0.080
[2,] 0.79 1.000 -0.015
[3,] 0.08 -0.015 1.000
*****************************
* Class 2
* Proportion 0.4950027
* Margins parameters of the continuous variables (Gaussian distribution):
Mean Sd
continuous -1.9 1.1
* Margins parameters of the integer variables (Poisson distribution):
Mean
integer 5.1
* Margins parameters of the ordinal variables (multinomial distribution):
Prob.0 Prob.1
binary 0.49 0.51
* Correlation matrix:
[,1] [,2] [,3]
[1,] 1.00 -0.47 0.39
[2,] -0.47 1.00 0.24
[3,] 0.39 0.24 1.00
*****************************
Visualisation We reproduce the elements of Figure 1
# Visualisation
# Update of the results (computing the conditional expectations of the latent vectors
# related to the Gaussian copulas)
res.mixclus <- MixClusUpdateForVisu(res.mixclus)
# Scatterplot of the individuals (Figure 1.(a)) described by three variables:
# one continuous (abscissa), one integer (ordiate) and one binary (symbol).
# Colors indicate the component memberships
plot(simu[,2:3], col=simu[,1], pch=16+simu[,4], xlab=expression(x^1), ylab=expression(x^2))
# Scatterplot of the individuals in the first PCA-map of the first-component of the model
MixClusVisu(res.mixclus, class = 1, figure = "scatter", xlim=c(-10,4), ylim=c(-4,4))
scatterplot
# Correlation circle of the first PCA-map of the first-component of the model
MixClusVisu(res.mixclus, class = 1, figure = "circle")
correlation circle