14 Cluster Validation

To close this Companion Book, we consider cluster validation measures. These can be classified as internal validation measures, such as the measures of fit we already considered, as well as external validation measures, which provide a way to compare cluster solutions to a given (known, or assumed known) reference.

In addition to measures of fit, we also consider indicators of the balance of cluster solutions, i.e., the evenness of the number of observations in each cluster. Such measures include entropy and Simpson’s index. Other measures, introduced in the GeoDa Cluster Book are based on the spatial properties of the clusters. These include the join count ratio, an indicator of how many neighbors of each observation in a cluster are also members of the cluster. For a spatially compact cluster solution, this measure should equal one (except for boundary effects). For non spatially constrained clusters, it indicates how closely they approximate a spatial solution.

For spatially constrained cluster solutions, compactness is a key characteristic. This can be quantified by means of the isoperimeter quotient (IPQ), the ratio of the area of a cluster shape to that of a circle with equal perimeter. A final measure of compactness introduced in the GeoDa Cluster Book is the diameter of the unweighted graph representation of the spatial weights matrix. To obtain a relative measure, the diameter is rescaled by the number of observations in the cluster. The latter measures are only applicable to spatially constrained clusters.

In addition, we also consider two classic indicators of external validity, i.e., the Adjusted Rand Index (ARI) of Hubert and Arabie (1985), based on counting pairs, and the Normalized Information Distance (NID), e.g., Vinh, Epps, and Bailey (2010), derived from measures of entropy.

A detailed coverage of these methods is contained in Chapter 12 of the GeoDa Cluster Book.

In addition to the usual numpy, pandas and geopandas, we need several specialized packages from scikit-learn and pygeoda to carry out the cluster analysis and implement the validation measures. As before, to carry out variable standardization we import StandardScaler from sklearn.preprocessing. The specific clustering methods are AgglomerativeClustering and KMeans from sklearn.cluster. The other clustering solutions are obtained with pygeoda. The external validation measures are contained in sklearn.metrics.adjusted_rand_score and sklearn.metrics.adjusted_mutual_info_score.

The new internal validation measures are based on pygeoda.spatial_validation. Several helper functions contained in the spatial-cluster-helper module extract the relevant information and present it as a pandas data frame: cluster_fragmentation, cluster_joincount, cluster_compactness and cluster_diameter. As before, we also use ensure_datasets, cluster_stats and cluster_fit.

We continue the empirical illustration with the Ceará example.

Required Packages

numpy, pandas, geopandas, sklearn.cluster, sklearn.preprocessing, sklearn.metrics, spatial_cluster_helper, pygeoda

Required Data Sets

ceara

14.1 Preliminaries

14.1.1 Import Required Modules

import geopandas as gpd
import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

from spatial_cluster_helper import ensure_datasets, cluster_stats, \
          cluster_fit, cluster_fragmentation, cluster_joincount, \
          cluster_compactness, cluster_diameter

import pygeoda

14.1.2 Load Data

We again read the ceara.shp file and carry out a quick check of the contens. For this sample data, we use the argument encoding = 'utf-8' in the read_file function to account for the special characters in Brazilian Portuguese.

# Setting working folder:
#path = "/your/path/to/data/"
path = "./datasets/"
# Select the Ceará data:
shpfile = "ceara/ceara.shp"
# Load the data:
ensure_datasets(shpfile, folder_path = path)
dfs = gpd.read_file(path + shpfile, encoding = 'utf-8')
print(dfs.shape)
dfs.head(3)

(184, 36)

	code7	mun_name	state_init	area_km2	state_code	micro_code	micro_name	inc_mic_4q	inc_zik_2q	...	gdp	pop	gdpcap	popdens	ziq_2q	zika_d	mic_d	geometry
0	2300101.0	Abaiara	CE	180.833	23	23019	19ª Região Brejo Santo	0.000000	0.00	...	35974.0	10496.0	3.427	58.043	0.0	0.0	0.0	POLYGON ((5433729.65 9186242.97, 5433688.546 9...
1	2300150.0	Acarape	CE	130.002	23	23003	3ª Região Maracanaú	6.380399	0.00	...	68314.0	15338.0	4.454	117.983	0.0	0.0	1.0	POLYGON ((5476916.288 9533405.667, 5476798.561...
2	2300200.0	Acaraú	CE	842.471	23	23012	12ª Região Acaraú	0.000000	1.63	...	309490.0	57551.0	5.378	68.312	1.0	1.0	0.0	POLYGON ((5294389.783 9689469.144, 5294494.499...

3 rows × 36 columns

14.1.3 Variables

We continue with the same set of variables as in the previous Chapters, listed in Table 12.1, and combine them in the list varlist.

varlist = ['mobility', 'environ', 'housing', 'sanitation', 'infra', 'gdpcap']

14.1.4 Pygeoda Data Preparation

We follow the same steps as in Chapter 12 to set up a data set in the pygeoda internal format as ceara_g and to create queen contiguity spatial weights (queen_w). As in the previous Chapters, we create subsets with the relevant variables in both the pygeoda format (data_g) and as a GeoDataFrame (data).

ceara_g = pygeoda.open(dfs)
queen_w = pygeoda.queen_weights(ceara_g)
print(queen_w)
data = dfs[varlist]
data_g = ceara_g[varlist]

Weights Meta-data:
 number of observations:                  184
           is symmetric:                 True
               sparsity:  0.02953686200378072
        # min neighbors:                    1
        # max neighbors:                   13
       # mean neighbors:    5.434782608695652
     # median neighbors:                  5.0
           has isolates:                False

Finally, we set the number of clusters to 13. This differs from the empirical examples in Chapter 12 of the GeoDa Cluster Book due to the result for max-p obtained with pygeoda in Chapter 13. For consistency, we set the number of clusters for the other methods to that value.

n_clusters = 13

14.2 Cluster Solutions

Before considering the validation measures, we compute the cluster solutions for hierarchical clustering (using sklearn.AgglomerativeClustering), K-Means (using sklearn.KMeans), and the spatially constrained clustering methods using pygeoda. For details on the arguments and helper functions, see the relevant Chapters.

For each cluster, we extract the labels and the fit using the respective helper functions.

To illustrate the internal validity measures, we will focus on Ward’s agglomerative clustering as an example of a standard method and on AZP with SCHC initial solution as an example of a spatially constrained cluster solution. For this, we also generate the cluster cardinalities. This is not illustrated for the other cluster solutions, but can be readily implemented.

14.2.1 Hierarchical Clustering

method = 'ward'
X = StandardScaler().fit_transform(data)

agg_clusters = AgglomerativeClustering(n_clusters = n_clusters, 
                    linkage = method, compute_distances = True)
agg_clusters.fit(X)
agg_labels = tuple(int(label) for label in agg_clusters.labels_)

agg_clusters_fit = cluster_fit(data = data, clustlabels =agg_clusters.labels_,
                 n_clusters = n_clusters, printopt = False)
agg_stats = cluster_stats(agg_labels)

 Labels  Cardinality
      0           16
      1           26
      2           29
      3            9
      4           14
      5            3
      6           26
      7           11
      8           20
      9            4
     10           15
     11           10
     12            1

14.2.2 K-Means Clustering

kmeans_clusters = KMeans(n_clusters = n_clusters, n_init = 150, 
                         random_state=123456789).fit(X) 
kmeans_labels = tuple(int(label) for label in kmeans_clusters.labels_)
kmeans_clusters_fit = cluster_fit(data = data, 
                 clustlabels = kmeans_clusters.labels_,
                 n_clusters = n_clusters, printopt = False)

14.2.3 SCHC with Ward’s Linkage

schc_clusters = pygeoda.schc(n_clusters, queen_w, data_g, "ward")
schc_labels = schc_clusters['Clusters']

14.2.4 SKATER

skater_clusters = pygeoda.skater(n_clusters, queen_w, data_g)
skater_labels = skater_clusters['Clusters']

14.2.5 REDCAP

redcap_clusters = pygeoda.redcap(n_clusters, queen_w, data_g, 
                        method = 'fullorder-wardlinkage')
redcap_labels = redcap_clusters['Clusters']

14.2.6 AZP with Simulated Annealing

azp_sa_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g, 
                        cooling_rate = 0.8, sa_maxit = 5)
azp_sa_labels = azp_sa_clusters['Clusters']

14.2.7 AZP with SCHC as Initial Solution

azp_schc_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g, 
                            cooling_rate = 0.8, sa_maxit = 5,
                            init_regions = schc_labels)
azp_schc_labels = azp_schc_clusters['Clusters']
azp_schc_stats = cluster_stats(azp_schc_labels)

 Labels  Cardinality
      1           89
      2           43
      3           15
      4           14
      5            6
      6            4
      7            4
      8            3
      9            2
     10            1
     11            1
     12            1
     13            1

14.2.8 Max-p Regions

maxp_sa_clusters = pygeoda.maxp_sa(queen_w, data_g, 
                            bound_variable = dfs['pop'], 
                            min_bound = dfs['pop'].sum()*0.05,
                            iterations = 9999,
                            cooling_rate = 0.9,
                            sa_maxit = 5)
maxp_sa_labels = maxp_sa_clusters['Clusters']

14.3 Internal Validation Measures

As mentioned, in addition to the classic measures of fit, we also consider fragmentation, the join count ratio, and, for spatially constrained cluster solutions, the compactness and diameter.

These measures are provided as attributes in the solution object created by pygeoda.spatial_validation. This requires the pygeoda data set, the cluster labels and the spatial weights as arguments.

We illustrate this for Ward’s agglomerative clustering, with agg_labels as the cluster labels, and for AZP-SCHC, with azp_schc_labels as the cluster labels. For both, the data set is ceara_g and the spatial weights are contained in queen_w.

We store the results in, respectively, agg_validation and azp_schc_validation. These objects will then be used as arguments to the helper functions.

agg_validation = pygeoda.spatial_validation(ceara_g, agg_labels, queen_w)
azp_schc_validation = pygeoda.spatial_validation(ceara_g, azp_schc_labels, 
                                                 queen_w)

14.3.1 Fragmentation

The fragmentation measures are computed from the makeup of the cluster components. An ideally balanced cluster is when each component has the same number of observations. This is quantified by means of entropy and its standardized counterpart, as well as by Simpson’s index and its standardized counterpart. For entropy, larger values suggest a greater balance, whereas for Simpson’s index, it is the other way around.

The pygeoda.spatial_validation return object includes the fragmentation information in two attributes: fragmentation and cluster_fragmentation. The first contains the overall measures, as well as the number of clusters, in fragmentation.n, fragmentation.entropy, fragmentation.std_entropy, fragmentation.simpson and fragmentation.std_simpson. The second has the same information, organized as a list by cluster. It shows the within-cluster fragmentation for clusters that are not spatially constrained.

The cluster_fragmentation helper function, takes a data frame with the labels and cardinalities (created by cluster_stats), and the cluster_fragmentation, fragmentation and spatially_constrained attributes from the validation object. The spatially_constrained flag is used to limit the fragmentation output to the totals only for spatially constrained clusters.

This is illustrated for Ward’s agglomerative clustering and AZP-SCHC. Note that for the latter, only the totals are given, since it is a spatially constrained solution.

The other cluster solutions can be analyzed in the same way.

14.3.1.1 Agglomerative clustering

agg_frag = cluster_fragmentation(agg_stats, 
                   agg_validation.cluster_fragmentation,
                   agg_validation.fragmentation, 
                   agg_validation.spatially_constrained)

Fragmentation
Label   N Sub  Entropy  Entropy*  Simpson  Simpson*
    0  16   9 1.751176  0.796994 0.250892  2.258026
    1  26  14 2.397937  0.908634 0.115385  1.615385
    2  29  14 2.425806  0.919194 0.109467  1.532544
    3   9  11 2.250260  0.938431 0.120000  1.320000
    4  14   9 1.923066  0.875225 0.187500  1.687500
    5   3   8 1.933810  0.929966 0.164444  1.315556
    6  26  10 2.168223  0.941647 0.132653  1.326531
    7  11   9 2.145842  0.976615 0.123967  1.115702
    8  20   6 1.609438  0.898244 0.240000  1.440000
    9   4   7 1.831020  0.940958 0.185185  1.296296
   10  15   0 0.000000  0.000000 0.000000  0.000000
   11  10   3 1.098612  1.000000 0.333333  1.000000
   12   1   0 0.000000  0.000000 0.000000  0.000000
  All 184     2.351159  0.916649 0.106274  1.381557

14.3.1.2 AZP-SCHC

azp_schc_frag = cluster_fragmentation(azp_schc_stats, 
                   azp_schc_validation.cluster_fragmentation,
                   azp_schc_validation.fragmentation, 
                   azp_schc_validation.spatially_constrained)

Fragmentation
Label   N Sub  Entropy  Entropy*  Simpson  Simpson*
  All 184     1.599116  0.623449 0.303521   3.94577

14.3.2 Join Count Ratio

The join count ratio is a spatial measure of the degree of internal connectedness in a cluster solution. It is computed for each cluster separately as well as for the cluster solution as a whole. It is a count of how many neighbors of observations in a cluster are also members of that cluster.

The relevant measures are included in the joincount_ratio and all_joincount_ratio attributes of the pygeoda.spatial_validation solution object. The former is a list with k entries, which are themselves objects, containing attributes n, neighbors, join_count and ratio. The latter is the same for the overall cluster solution.

The result is provided by the cluster_joincount helper function. It takes the data frame with cluster cardinalities and the joincount_ratio and all_joincount_ratio attributes from the validation solution.

We use the same examples in the illustration below.

14.3.2.1 Agglomerative clustering

agg_jc = cluster_joincount(agg_stats, agg_validation.joincount_ratio,
                   agg_validation.all_joincount_ratio)

Join Count Ratio
Label   N  Neighbors  Join Count  Ratio
    0  16         96          20  0.208
    1  26        134          26  0.194
    2  29        158          64  0.405
    3   9         47           6  0.128
    4  14         80           8  0.100
    5   3         15           0  0.000
    6  26        158          30  0.190
    7  11         43           4  0.093
    8  20        116          18  0.155
    9   4         21          10  0.476
   10  15         83          18  0.217
   11  10         46           8  0.174
   12   1          3           0  0.000
  All 184       1000         212  0.212

14.3.2.2 AZP-SCHC

azp_schc_jc = cluster_joincount(azp_schc_stats, 
                   azp_schc_validation.joincount_ratio,
                   azp_schc_validation.all_joincount_ratio)

Join Count Ratio
Label   N  Neighbors  Join Count  Ratio
    0  89        474         342  0.722
    1  43        225         128  0.569
    2  15         96          42  0.438
    3  14         78          44  0.564
    4   6         37          10  0.270
    5   4         17           6  0.353
    6   4         21          10  0.476
    7   3         22           4  0.182
    8   2         12           2  0.167
    9   1          3           0  0.000
   10   1          3           0  0.000
   11   1          7           0  0.000
   12   1          5           0  0.000
  All 184       1000         588  0.588

14.3.3 Compactness

Compactness is a criterion that is only applicable to spatially constrained cluster solutions. It measures the ratio of the perimeter of the cluster to that of a circle with the same area. The compactness attribute of the spatial_validation object is a list with k items, each an object with attributes area, perimeter and isoperimeter_quotient. The closer the IPQ is to one, the more compact is the cluster shape.

The helper function cluster_compactness extracts this information. Its attributes are the data frame with cluster cardinalities (from cluster_stats), the compactness attribute, and the spatially_constrained attribute. Compactness is not relevant for clusters that are not spatially constrained and therefore the helper function will yield an error message when this happens.

We continue with the same two examples. Note that for Ward’s agglomerative cluster, an error message is generated.

14.3.3.1 Agglomerative clustering

agg_compactness = cluster_compactness(agg_stats, 
                                agg_validation.compactness,
                                agg_validation.spatially_constrained)

Error: Compactness is only applicable to spatially constrained clusters

14.3.3.2 AZP-SCHC

azp_schc_compactness = cluster_compactness(azp_schc_stats, 
                                azp_schc_validation.compactness,
                                azp_schc_validation.spatially_constrained)

Compactness
 Label  N         Area    Perimeter      IPQ
     0 89 8.672968e+10 1.536222e+07 0.004618
     1 43 2.797625e+10 6.198588e+06 0.009150
     2 15 1.053291e+10 2.220946e+06 0.026834
     3 14 1.138704e+10 2.122643e+06 0.031759
     4  6 4.038275e+09 9.257796e+05 0.059209
     5  4 3.085985e+09 5.499120e+05 0.128238
     6  4 1.779537e+09 4.240068e+05 0.124386
     7  3 9.969357e+08 3.168568e+05 0.124782
     8  2 1.474872e+09 3.229358e+05 0.177718
     9  1 6.171934e+08 1.199243e+05 0.539283
    10  1 7.908164e+07 5.056877e+04 0.388616
    11  1 8.449201e+08 1.739887e+05 0.350738
    12  1 4.124364e+08 1.255397e+05 0.328855

14.3.4 Diameter

The diameter of a spatially constrained cluster is an alternative measure of compactness, based on the network structure reflected in the spatial weights. The diameter of a cluster is the number of steps in the spatial weights graph that corresponds with the longest shortest path between any pair of observations (Newman 2018). Since this number will increase with cluster size, it is also standardized by dividing by the number of cluster members. Note that when a cluster is a singleton, the diameter will be zero.

The diameter attribute of the pygeoda.spatial_validation object is a list with k items, one for each cluster, as an object with attributes steps and ratio. The helper function cluster_diameter extracts this information as a data frame. It takes as arguments the cluster cardinalities (from cluster_stats), and the diameter and spatially_constrained attributes from the pygeoda.spatial_validation object.

As in the case of compactness, an error message is generated for clusters that are not spatially constrained.

Again, we use the same two examples.

14.3.4.1 Agglomerative clustering

agg_diam = cluster_diameter(agg_stats, agg_validation.diameter,
                            agg_validation.spatially_constrained)

Error: Diameter is only applicable to spatially constrained clusters

14.3.4.2 AZP-SCHC

azp_schc_diam = cluster_diameter(azp_schc_stats, azp_schc_validation.diameter,
                            azp_schc_validation.spatially_constrained)

Diameter
 Label  N  Steps    Ratio
     0 89     22 0.247191
     1 43     17 0.395349
     2 15      6 0.400000
     3 14      7 0.500000
     4  6      3 0.500000
     5  4      3 0.750000
     6  4      2 0.500000
     7  3      2 0.666667
     8  2      1 0.500000
     9  1      0 0.000000
    10  1      0 0.000000
    11  1      0 0.000000
    12  1      0 0.000000

14.3.5 Overall Comparison of Internal Validation Measures

We conclude our discussion of internal validation measures with an overview of the main non-spatial measures for all the cluster solutions considered above.

clusters = [
    agg_clusters_fit, kmeans_clusters_fit, schc_clusters, skater_clusters,
    redcap_clusters, azp_sa_clusters, azp_schc_clusters, maxp_sa_clusters
]
labels = [
    agg_labels, kmeans_labels, schc_labels, skater_labels,
    redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
    'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
    'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]

results = []

# Run pygeoda.spatial_validation for each label set
for cluster, label, name in zip(clusters, labels, label_names):
    result = pygeoda.spatial_validation(ceara_g, label, queen_w)
    
    try:
        wss = np.round(cluster['Total within-cluster sum of squares'], 4)
        bss_tss = np.round(cluster['The ratio of between to total sum of squares'], 4)
    except:
        try:
            wss = np.round(cluster["WSS"], 2)
            bss_tss = np.round(cluster["Ratio"], 2)
        except:
            wss = None
            bss_tss = None

    spatially_constrained = result.spatially_constrained
    all_join_count_ratio = np.round(result.all_joincount_ratio.ratio, 4)
    entropy = np.round(result.fragmentation.entropy, 4)
    simpson = np.round(result.fragmentation.simpson, 4)
    
    results.append({
        'Method': name,
        'Spat. Const.': spatially_constrained,
        'WSS': wss,
        'BSS/TSS': bss_tss,
        'Join Count': all_join_count_ratio,
        'Entropy': entropy,
        'Simpson': simpson
    })

validation = pd.DataFrame(results)
print(validation.to_string(index = False))

      Method  Spat. Const.      WSS  BSS/TSS  Join Count  Entropy  Simpson
Hierarchical         False 349.0000   0.6800       0.212   2.3512   0.1063
     K-Means         False 334.3800   0.7000       0.240   2.3213   0.1112
        SCHC          True 568.0173   0.4827       0.668   1.6394   0.2730
      SKATER          True 604.2209   0.4497       0.784   1.3661   0.3901
      REDCAP          True 562.7003   0.4875       0.660   1.6109   0.2769
         AZP          True 617.0010   0.4381       0.526   1.9521   0.2016
 AZP_Initial          True 538.3477   0.5097       0.588   1.5991   0.3035
       Max-p          True 745.2474   0.3213       0.496   2.4175   0.0959

14.4 External Validation Measures

External validation measures are designed to compare a cluster solution to a truth, but they can also be employed to compare several cluster solutions to each other. The validation indices reveal how close the cluster solutions are. We consider two measures, the adjusted rand index and the normalized information distance.

14.4.1 Adjusted Rand Index (ARI)

The adjusted rand index is based on counting how many pairs of observations are in the same grouping for two cluster solutions. It can be computed by sklearn.metrics.adjusted_rand_score. The two arguments are numpy arrays of the labels of the reference (first argument) and the labels of the cluster to be compared.

Note that we need to convert our labels solution to a numpy array to make this work.

For example, the ARI between Ward’s agglomerative solution and K-Means is found by passing numpy arrays for agg_labels and kmeans_labels. The result smaller than 0.5 suggests only low correspondence.

ari = adjusted_rand_score(np.array(agg_labels), np.array(kmeans_labels))
print(np.round(ari, 3))

0.392

We can now compute all pairwise indices with a simple loop. We first recreate the labels and label_names lists from above (so this can be run without the internal validation measures).

labels = [
    agg_labels, kmeans_labels, schc_labels, skater_labels,
    redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
    'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
    'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]

In a simple loop, we compute all pairwise indices and populate a matrix. This is then turned into a data frame and printed. Note that the matrix is symmetric and the diagonal values of 1.0 can be ignored.

h = len(labels)
allari = np.zeros((h, h))
for i in range(h):
    labi = np.array(labels[i])
    for j in range(h):
        labj = np.array(labels[j])
        allari[i,j] = adjusted_rand_score(labi, labj)
dfari = pd.DataFrame(allari, columns = label_names, index = label_names)
print(np.round(dfari, 3))

              Hierarchical  K-Means   SCHC  SKATER  REDCAP    AZP  \
Hierarchical         1.000    0.392  0.131   0.079   0.145  0.160   
K-Means              0.392    1.000  0.176   0.095   0.184  0.205   
SCHC                 0.131    0.176  1.000   0.424   0.918  0.504   
SKATER               0.079    0.095  0.424   1.000   0.384  0.258   
REDCAP               0.145    0.184  0.918   0.384   1.000  0.516   
AZP                  0.160    0.205  0.504   0.258   0.516  1.000   
AZP_Initial          0.155    0.165  0.735   0.452   0.727  0.577   
Max-p                0.089    0.091  0.194   0.173   0.185  0.218   

              AZP_Initial  Max-p  
Hierarchical        0.155  0.089  
K-Means             0.165  0.091  
SCHC                0.735  0.194  
SKATER              0.452  0.173  
REDCAP              0.727  0.185  
AZP                 0.577  0.218  
AZP_Initial         1.000  0.163  
Max-p               0.163  1.000

As in Chapter 12 of the GeoDa Cluster Book, the matrix reveals a much closer correspondence among the non-spatial solutions and the spatial solutions respectively. The greatest correspondence is between SCHC and Redcap, with an ARI of 0.918.

14.4.2 Normalized Information Distance (NID)

The second external validation measure is based on information-theoretic considerations, such as entropy. In Chapter 12 of the GeoDa Cluster Book, the normalized information distance is introduced (NID). A close counterpart can be computed by means of sklearn.metrics.adjusted_mutual_info_score. The arguments are the same as for ARI. However, in contrast to NID as presented in the GeoDa Cluster Book, a higher value for the adjusted mutual information score indicates closer similarity.

We first illustrate this for Ward’s agglomerative clustering and K-Means, passing numpy arrays of agg_labels and kmeans_labels.

nid = adjusted_mutual_info_score(np.array(agg_labels), np.array(kmeans_labels))
print(np.round(nid, 3))

0.614

Finally, we run the same loop as for ARI to compute all pairwise NID scores.

allnid = np.zeros((h, h))
for i in range(h):
    labi = np.array(labels[i])
    for j in range(h):
        labj = np.array(labels[j])
        allnid[i,j] = adjusted_mutual_info_score(labi, labj)
dfnid = pd.DataFrame(allnid, columns = label_names, index = label_names)
print(np.round(dfnid, 3))

              Hierarchical  K-Means   SCHC  SKATER  REDCAP    AZP  \
Hierarchical         1.000    0.614  0.257   0.219   0.267  0.306   
K-Means              0.614    1.000  0.300   0.244   0.308  0.328   
SCHC                 0.257    0.300  1.000   0.548   0.897  0.581   
SKATER               0.219    0.244  0.548   1.000   0.496  0.425   
REDCAP               0.267    0.308  0.897   0.496   1.000  0.595   
AZP                  0.306    0.328  0.581   0.425   0.595  1.000   
AZP_Initial          0.317    0.322  0.727   0.545   0.727  0.646   
Max-p                0.177    0.191  0.388   0.423   0.372  0.445   

              AZP_Initial  Max-p  
Hierarchical        0.317  0.177  
K-Means             0.322  0.191  
SCHC                0.727  0.388  
SKATER              0.545  0.423  
REDCAP              0.727  0.372  
AZP                 0.646  0.445  
AZP_Initial         1.000  0.367  
Max-p               0.367  1.000

As for ARI, we find the closest correspondence between SCHC and Redcap.

14.5 Practice

We now have all the available tools to compare the various cluster solutions obtained in earlier Chapters. In addition, when an administrative regionalization is available (e.g., subdistricts in a city), the various cluster solutions can be compared to that truth.