import geopandas as gpd
import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score
from spatial_cluster_helper import ensure_datasets, cluster_stats, \
cluster_fit, cluster_fragmentation, cluster_joincount, \
cluster_compactness, cluster_diameter
import pygeoda14 Cluster Validation
To close this Companion Book, we consider cluster validation measures. These can be classified as internal validation measures, such as the measures of fit we already considered, as well as external validation measures, which provide a way to compare cluster solutions to a given (known, or assumed known) reference.
In addition to measures of fit, we also consider indicators of the balance of cluster solutions, i.e., the evenness of the number of observations in each cluster. Such measures include entropy and Simpson’s index. Other measures, introduced in the GeoDa Cluster Book are based on the spatial properties of the clusters. These include the join count ratio, an indicator of how many neighbors of each observation in a cluster are also members of the cluster. For a spatially compact cluster solution, this measure should equal one (except for boundary effects). For non spatially constrained clusters, it indicates how closely they approximate a spatial solution.
For spatially constrained cluster solutions, compactness is a key characteristic. This can be quantified by means of the isoperimeter quotient (IPQ), the ratio of the area of a cluster shape to that of a circle with equal perimeter. A final measure of compactness introduced in the GeoDa Cluster Book is the diameter of the unweighted graph representation of the spatial weights matrix. To obtain a relative measure, the diameter is rescaled by the number of observations in the cluster. The latter measures are only applicable to spatially constrained clusters.
In addition, we also consider two classic indicators of external validity, i.e., the Adjusted Rand Index (ARI) of Hubert and Arabie (1985), based on counting pairs, and the Normalized Information Distance (NID), e.g., Vinh, Epps, and Bailey (2010), derived from measures of entropy.
A detailed coverage of these methods is contained in Chapter 12 of the GeoDa Cluster Book.
In addition to the usual numpy, pandas and geopandas, we need several specialized packages from scikit-learn and pygeoda to carry out the cluster analysis and implement the validation measures. As before, to carry out variable standardization we import StandardScaler from sklearn.preprocessing. The specific clustering methods are AgglomerativeClustering and KMeans from sklearn.cluster. The other clustering solutions are obtained with pygeoda. The external validation measures are contained in sklearn.metrics.adjusted_rand_score and sklearn.metrics.adjusted_mutual_info_score.
The new internal validation measures are based on pygeoda.spatial_validation. Several helper functions contained in the spatial-cluster-helper module extract the relevant information and present it as a pandas data frame: cluster_fragmentation, cluster_joincount, cluster_compactness and cluster_diameter. As before, we also use ensure_datasets, cluster_stats and cluster_fit.
We continue the empirical illustration with the Ceará example.
numpy, pandas, geopandas, sklearn.cluster, sklearn.preprocessing, sklearn.metrics, spatial_cluster_helper, pygeoda
ceara
14.1 Preliminaries
14.1.1 Import Required Modules
14.1.2 Load Data
We again read the ceara.shp file and carry out a quick check of the contens. For this sample data, we use the argument encoding = 'utf-8' in the read_file function to account for the special characters in Brazilian Portuguese.
# Setting working folder:
#path = "/your/path/to/data/"
path = "./datasets/"
# Select the Ceará data:
shpfile = "ceara/ceara.shp"
# Load the data:
ensure_datasets(shpfile, folder_path = path)
dfs = gpd.read_file(path + shpfile, encoding = 'utf-8')
print(dfs.shape)
dfs.head(3)(184, 36)
| code7 | mun_name | state_init | area_km2 | state_code | micro_code | micro_name | inc_mic_4q | inc_zik_3q | inc_zik_2q | ... | gdp | pop | gdpcap | popdens | zik_1q | ziq_2q | ziq_3q | zika_d | mic_d | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2300101.0 | Abaiara | CE | 180.833 | 23 | 23019 | 19ª Região Brejo Santo | 0.000000 | 0.0 | 0.00 | ... | 35974.0 | 10496.0 | 3.427 | 58.043 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | POLYGON ((5433729.65 9186242.97, 5433688.546 9... |
| 1 | 2300150.0 | Acarape | CE | 130.002 | 23 | 23003 | 3ª Região Maracanaú | 6.380399 | 0.0 | 0.00 | ... | 68314.0 | 15338.0 | 4.454 | 117.983 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | POLYGON ((5476916.288 9533405.667, 5476798.561... |
| 2 | 2300200.0 | Acaraú | CE | 842.471 | 23 | 23012 | 12ª Região Acaraú | 0.000000 | 0.0 | 1.63 | ... | 309490.0 | 57551.0 | 5.378 | 68.312 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | POLYGON ((5294389.783 9689469.144, 5294494.499... |
3 rows × 36 columns
14.1.3 Variables
We continue with the same set of variables as in the previous Chapters, listed in Table 12.1, and combine them in the list varlist.
varlist = ['mobility', 'environ', 'housing', 'sanitation', 'infra', 'gdpcap']14.1.4 Pygeoda Data Preparation
We follow the same steps as in Chapter 12 to set up a data set in the pygeoda internal format as ceara_g and to create queen contiguity spatial weights (queen_w). As in the previous Chapters, we create subsets with the relevant variables in both the pygeoda format (data_g) and as a GeoDataFrame (data).
ceara_g = pygeoda.open(dfs)
queen_w = pygeoda.queen_weights(ceara_g)
print(queen_w)
data = dfs[varlist]
data_g = ceara_g[varlist]Weights Meta-data:
number of observations: 184
is symmetric: True
sparsity: 0.02953686200378072
# min neighbors: 1
# max neighbors: 13
# mean neighbors: 5.434782608695652
# median neighbors: 5.0
has isolates: False
Finally, we set the number of clusters to 13. This differs from the empirical examples in Chapter 12 of the GeoDa Cluster Book due to the result for max-p obtained with pygeoda in Chapter 13. For consistency, we set the number of clusters for the other methods to that value.
n_clusters = 1314.2 Cluster Solutions
Before considering the validation measures, we compute the cluster solutions for hierarchical clustering (using sklearn.AgglomerativeClustering), K-Means (using sklearn.KMeans), and the spatially constrained clustering methods using pygeoda. For details on the arguments and helper functions, see the relevant Chapters.
For each cluster, we extract the labels and the fit using the respective helper functions.
To illustrate the internal validity measures, we will focus on Ward’s agglomerative clustering as an example of a standard method and on AZP with SCHC initial solution as an example of a spatially constrained cluster solution. For this, we also generate the cluster cardinalities. This is not illustrated for the other cluster solutions, but can be readily implemented.
14.2.1 Hierarchical Clustering
method = 'ward'
X = StandardScaler().fit_transform(data)
agg_clusters = AgglomerativeClustering(n_clusters = n_clusters,
linkage = method, compute_distances = True)
agg_clusters.fit(X)
agg_labels = tuple(int(label) for label in agg_clusters.labels_)
agg_clusters_fit = cluster_fit(data = data, clustlabels =agg_clusters.labels_,
n_clusters = n_clusters, printopt = False)
agg_stats = cluster_stats(agg_labels) Labels Cardinality
0 16
1 26
2 29
3 9
4 14
5 3
6 26
7 11
8 20
9 4
10 15
11 10
12 1
14.2.2 K-Means Clustering
kmeans_clusters = KMeans(n_clusters = n_clusters, n_init = 150,
random_state=123456789).fit(X)
kmeans_labels = tuple(int(label) for label in kmeans_clusters.labels_)
kmeans_clusters_fit = cluster_fit(data = data,
clustlabels = kmeans_clusters.labels_,
n_clusters = n_clusters, printopt = False)14.2.3 SCHC with Ward’s Linkage
schc_clusters = pygeoda.schc(n_clusters, queen_w, data_g, "ward")
schc_labels = schc_clusters['Clusters']14.2.4 SKATER
skater_clusters = pygeoda.skater(n_clusters, queen_w, data_g)
skater_labels = skater_clusters['Clusters']14.2.5 REDCAP
redcap_clusters = pygeoda.redcap(n_clusters, queen_w, data_g,
method = 'fullorder-wardlinkage')
redcap_labels = redcap_clusters['Clusters']14.2.6 AZP with Simulated Annealing
azp_sa_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g,
cooling_rate = 0.8, sa_maxit = 5)
azp_sa_labels = azp_sa_clusters['Clusters']14.2.7 AZP with SCHC as Initial Solution
azp_schc_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g,
cooling_rate = 0.8, sa_maxit = 5,
init_regions = schc_labels)
azp_schc_labels = azp_schc_clusters['Clusters']
azp_schc_stats = cluster_stats(azp_schc_labels) Labels Cardinality
1 89
2 43
3 15
4 14
5 6
6 4
7 4
8 3
9 2
10 1
11 1
12 1
13 1
14.2.8 Max-p Regions
maxp_sa_clusters = pygeoda.maxp_sa(queen_w, data_g,
bound_variable = dfs['pop'],
min_bound = dfs['pop'].sum()*0.05,
iterations = 9999,
cooling_rate = 0.9,
sa_maxit = 5)
maxp_sa_labels = maxp_sa_clusters['Clusters']14.3 Internal Validation Measures
As mentioned, in addition to the classic measures of fit, we also consider fragmentation, the join count ratio, and, for spatially constrained cluster solutions, the compactness and diameter.
These measures are provided as attributes in the solution object created by pygeoda.spatial_validation. This requires the pygeoda data set, the cluster labels and the spatial weights as arguments.
We illustrate this for Ward’s agglomerative clustering, with agg_labels as the cluster labels, and for AZP-SCHC, with azp_schc_labels as the cluster labels. For both, the data set is ceara_g and the spatial weights are contained in queen_w.
We store the results in, respectively, agg_validation and azp_schc_validation. These objects will then be used as arguments to the helper functions.
agg_validation = pygeoda.spatial_validation(ceara_g, agg_labels, queen_w)
azp_schc_validation = pygeoda.spatial_validation(ceara_g, azp_schc_labels,
queen_w)14.3.1 Fragmentation
The fragmentation measures are computed from the makeup of the cluster components. An ideally balanced cluster is when each component has the same number of observations. This is quantified by means of entropy and its standardized counterpart, as well as by Simpson’s index and its standardized counterpart. For entropy, larger values suggest a greater balance, whereas for Simpson’s index, it is the other way around.
The pygeoda.spatial_validation return object includes the fragmentation information in two attributes: fragmentation and cluster_fragmentation. The first contains the overall measures, as well as the number of clusters, in fragmentation.n, fragmentation.entropy, fragmentation.std_entropy, fragmentation.simpson and fragmentation.std_simpson. The second has the same information, organized as a list by cluster. It shows the within-cluster fragmentation for clusters that are not spatially constrained.
The cluster_fragmentation helper function, takes a data frame with the labels and cardinalities (created by cluster_stats), and the cluster_fragmentation, fragmentation and spatially_constrained attributes from the validation object. The spatially_constrained flag is used to limit the fragmentation output to the totals only for spatially constrained clusters.
This is illustrated for Ward’s agglomerative clustering and AZP-SCHC. Note that for the latter, only the totals are given, since it is a spatially constrained solution.
The other cluster solutions can be analyzed in the same way.
14.3.1.1 Agglomerative clustering
agg_frag = cluster_fragmentation(agg_stats,
agg_validation.cluster_fragmentation,
agg_validation.fragmentation,
agg_validation.spatially_constrained)Fragmentation
Label N Sub Entropy Entropy* Simpson Simpson*
0 16 9 1.751176 0.796994 0.250892 2.258026
1 26 14 2.397937 0.908634 0.115385 1.615385
2 29 14 2.425806 0.919194 0.109467 1.532544
3 9 11 2.250260 0.938431 0.120000 1.320000
4 14 9 1.923066 0.875225 0.187500 1.687500
5 3 8 1.933810 0.929966 0.164444 1.315556
6 26 10 2.168223 0.941647 0.132653 1.326531
7 11 9 2.145842 0.976615 0.123967 1.115702
8 20 6 1.609438 0.898244 0.240000 1.440000
9 4 7 1.831020 0.940958 0.185185 1.296296
10 15 0 0.000000 0.000000 0.000000 0.000000
11 10 3 1.098612 1.000000 0.333333 1.000000
12 1 0 0.000000 0.000000 0.000000 0.000000
All 184 2.351159 0.916649 0.106274 1.381557
14.3.1.2 AZP-SCHC
azp_schc_frag = cluster_fragmentation(azp_schc_stats,
azp_schc_validation.cluster_fragmentation,
azp_schc_validation.fragmentation,
azp_schc_validation.spatially_constrained)Fragmentation
Label N Sub Entropy Entropy* Simpson Simpson*
All 184 1.599116 0.623449 0.303521 3.94577
14.3.2 Join Count Ratio
The join count ratio is a spatial measure of the degree of internal connectedness in a cluster solution. It is computed for each cluster separately as well as for the cluster solution as a whole. It is a count of how many neighbors of observations in a cluster are also members of that cluster.
The relevant measures are included in the joincount_ratio and all_joincount_ratio attributes of the pygeoda.spatial_validation solution object. The former is a list with k entries, which are themselves objects, containing attributes n, neighbors, join_count and ratio. The latter is the same for the overall cluster solution.
The result is provided by the cluster_joincount helper function. It takes the data frame with cluster cardinalities and the joincount_ratio and all_joincount_ratio attributes from the validation solution.
We use the same examples in the illustration below.
14.3.2.1 Agglomerative clustering
agg_jc = cluster_joincount(agg_stats, agg_validation.joincount_ratio,
agg_validation.all_joincount_ratio)Join Count Ratio
Label N Neighbors Join Count Ratio
0 16 96 20 0.208
1 26 134 26 0.194
2 29 158 64 0.405
3 9 47 6 0.128
4 14 80 8 0.100
5 3 15 0 0.000
6 26 158 30 0.190
7 11 43 4 0.093
8 20 116 18 0.155
9 4 21 10 0.476
10 15 83 18 0.217
11 10 46 8 0.174
12 1 3 0 0.000
All 184 1000 212 0.212
14.3.2.2 AZP-SCHC
azp_schc_jc = cluster_joincount(azp_schc_stats,
azp_schc_validation.joincount_ratio,
azp_schc_validation.all_joincount_ratio)Join Count Ratio
Label N Neighbors Join Count Ratio
0 89 474 342 0.722
1 43 225 128 0.569
2 15 96 42 0.438
3 14 78 44 0.564
4 6 37 10 0.270
5 4 17 6 0.353
6 4 21 10 0.476
7 3 22 4 0.182
8 2 12 2 0.167
9 1 3 0 0.000
10 1 3 0 0.000
11 1 7 0 0.000
12 1 5 0 0.000
All 184 1000 588 0.588
14.3.3 Compactness
Compactness is a criterion that is only applicable to spatially constrained cluster solutions. It measures the ratio of the perimeter of the cluster to that of a circle with the same area. The compactness attribute of the spatial_validation object is a list with k items, each an object with attributes area, perimeter and isoperimeter_quotient. The closer the IPQ is to one, the more compact is the cluster shape.
The helper function cluster_compactness extracts this information. Its attributes are the data frame with cluster cardinalities (from cluster_stats), the compactness attribute, and the spatially_constrained attribute. Compactness is not relevant for clusters that are not spatially constrained and therefore the helper function will yield an error message when this happens.
We continue with the same two examples. Note that for Ward’s agglomerative cluster, an error message is generated.
14.3.3.1 Agglomerative clustering
agg_compactness = cluster_compactness(agg_stats,
agg_validation.compactness,
agg_validation.spatially_constrained)Error: Compactness is only applicable to spatially constrained clusters
14.3.3.2 AZP-SCHC
azp_schc_compactness = cluster_compactness(azp_schc_stats,
azp_schc_validation.compactness,
azp_schc_validation.spatially_constrained)Compactness
Label N Area Perimeter IPQ
0 89 8.672968e+10 1.536222e+07 0.004618
1 43 2.797625e+10 6.198588e+06 0.009150
2 15 1.053291e+10 2.220946e+06 0.026834
3 14 1.138704e+10 2.122643e+06 0.031759
4 6 4.038275e+09 9.257796e+05 0.059209
5 4 3.085985e+09 5.499120e+05 0.128238
6 4 1.779537e+09 4.240068e+05 0.124386
7 3 9.969357e+08 3.168568e+05 0.124782
8 2 1.474872e+09 3.229358e+05 0.177718
9 1 6.171934e+08 1.199243e+05 0.539283
10 1 7.908164e+07 5.056877e+04 0.388616
11 1 8.449201e+08 1.739887e+05 0.350738
12 1 4.124364e+08 1.255397e+05 0.328855
14.3.4 Diameter
The diameter of a spatially constrained cluster is an alternative measure of compactness, based on the network structure reflected in the spatial weights. The diameter of a cluster is the number of steps in the spatial weights graph that corresponds with the longest shortest path between any pair of observations (Newman 2018). Since this number will increase with cluster size, it is also standardized by dividing by the number of cluster members. Note that when a cluster is a singleton, the diameter will be zero.
The diameter attribute of the pygeoda.spatial_validation object is a list with k items, one for each cluster, as an object with attributes steps and ratio. The helper function cluster_diameter extracts this information as a data frame. It takes as arguments the cluster cardinalities (from cluster_stats), and the diameter and spatially_constrained attributes from the pygeoda.spatial_validation object.
As in the case of compactness, an error message is generated for clusters that are not spatially constrained.
Again, we use the same two examples.
14.3.4.1 Agglomerative clustering
agg_diam = cluster_diameter(agg_stats, agg_validation.diameter,
agg_validation.spatially_constrained)Error: Diameter is only applicable to spatially constrained clusters
14.3.4.2 AZP-SCHC
azp_schc_diam = cluster_diameter(azp_schc_stats, azp_schc_validation.diameter,
azp_schc_validation.spatially_constrained)Diameter
Label N Steps Ratio
0 89 22 0.247191
1 43 17 0.395349
2 15 6 0.400000
3 14 7 0.500000
4 6 3 0.500000
5 4 3 0.750000
6 4 2 0.500000
7 3 2 0.666667
8 2 1 0.500000
9 1 0 0.000000
10 1 0 0.000000
11 1 0 0.000000
12 1 0 0.000000
14.3.5 Overall Comparison of Internal Validation Measures
We conclude our discussion of internal validation measures with an overview of the main non-spatial measures for all the cluster solutions considered above.
clusters = [
agg_clusters_fit, kmeans_clusters_fit, schc_clusters, skater_clusters,
redcap_clusters, azp_sa_clusters, azp_schc_clusters, maxp_sa_clusters
]
labels = [
agg_labels, kmeans_labels, schc_labels, skater_labels,
redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]
results = []
# Run pygeoda.spatial_validation for each label set
for cluster, label, name in zip(clusters, labels, label_names):
result = pygeoda.spatial_validation(ceara_g, label, queen_w)
try:
wss = np.round(cluster['Total within-cluster sum of squares'], 4)
bss_tss = np.round(cluster['The ratio of between to total sum of squares'], 4)
except:
try:
wss = np.round(cluster["WSS"], 2)
bss_tss = np.round(cluster["Ratio"], 2)
except:
wss = None
bss_tss = None
spatially_constrained = result.spatially_constrained
all_join_count_ratio = np.round(result.all_joincount_ratio.ratio, 4)
entropy = np.round(result.fragmentation.entropy, 4)
simpson = np.round(result.fragmentation.simpson, 4)
results.append({
'Method': name,
'Spat. Const.': spatially_constrained,
'WSS': wss,
'BSS/TSS': bss_tss,
'Join Count': all_join_count_ratio,
'Entropy': entropy,
'Simpson': simpson
})
validation = pd.DataFrame(results)
print(validation.to_string(index = False)) Method Spat. Const. WSS BSS/TSS Join Count Entropy Simpson
Hierarchical False 349.0000 0.6800 0.212 2.3512 0.1063
K-Means False 334.3800 0.7000 0.240 2.3213 0.1112
SCHC True 568.0173 0.4827 0.668 1.6394 0.2730
SKATER True 604.2209 0.4497 0.784 1.3661 0.3901
REDCAP True 562.7003 0.4875 0.660 1.6109 0.2769
AZP True 617.0010 0.4381 0.526 1.9521 0.2016
AZP_Initial True 538.3477 0.5097 0.588 1.5991 0.3035
Max-p True 745.2474 0.3213 0.496 2.4175 0.0959
14.4 External Validation Measures
External validation measures are designed to compare a cluster solution to a truth, but they can also be employed to compare several cluster solutions to each other. The validation indices reveal how close the cluster solutions are. We consider two measures, the adjusted rand index and the normalized information distance.
14.4.1 Adjusted Rand Index (ARI)
The adjusted rand index is based on counting how many pairs of observations are in the same grouping for two cluster solutions. It can be computed by sklearn.metrics.adjusted_rand_score. The two arguments are numpy arrays of the labels of the reference (first argument) and the labels of the cluster to be compared.
Note that we need to convert our labels solution to a numpy array to make this work.
For example, the ARI between Ward’s agglomerative solution and K-Means is found by passing numpy arrays for agg_labels and kmeans_labels. The result smaller than 0.5 suggests only low correspondence.
ari = adjusted_rand_score(np.array(agg_labels), np.array(kmeans_labels))
print(np.round(ari, 3))0.392
We can now compute all pairwise indices with a simple loop. We first recreate the labels and label_names lists from above (so this can be run without the internal validation measures).
labels = [
agg_labels, kmeans_labels, schc_labels, skater_labels,
redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]In a simple loop, we compute all pairwise indices and populate a matrix. This is then turned into a data frame and printed. Note that the matrix is symmetric and the diagonal values of 1.0 can be ignored.
h = len(labels)
allari = np.zeros((h, h))
for i in range(h):
labi = np.array(labels[i])
for j in range(h):
labj = np.array(labels[j])
allari[i,j] = adjusted_rand_score(labi, labj)
dfari = pd.DataFrame(allari, columns = label_names, index = label_names)
print(np.round(dfari, 3)) Hierarchical K-Means SCHC SKATER REDCAP AZP \
Hierarchical 1.000 0.392 0.131 0.079 0.145 0.160
K-Means 0.392 1.000 0.176 0.095 0.184 0.205
SCHC 0.131 0.176 1.000 0.424 0.918 0.504
SKATER 0.079 0.095 0.424 1.000 0.384 0.258
REDCAP 0.145 0.184 0.918 0.384 1.000 0.516
AZP 0.160 0.205 0.504 0.258 0.516 1.000
AZP_Initial 0.155 0.165 0.735 0.452 0.727 0.577
Max-p 0.089 0.091 0.194 0.173 0.185 0.218
AZP_Initial Max-p
Hierarchical 0.155 0.089
K-Means 0.165 0.091
SCHC 0.735 0.194
SKATER 0.452 0.173
REDCAP 0.727 0.185
AZP 0.577 0.218
AZP_Initial 1.000 0.163
Max-p 0.163 1.000
As in Chapter 12 of the GeoDa Cluster Book, the matrix reveals a much closer correspondence among the non-spatial solutions and the spatial solutions respectively. The greatest correspondence is between SCHC and Redcap, with an ARI of 0.918.
14.4.2 Normalized Information Distance (NID)
The second external validation measure is based on information-theoretic considerations, such as entropy. In Chapter 12 of the GeoDa Cluster Book, the normalized information distance is introduced (NID). A close counterpart can be computed by means of sklearn.metrics.adjusted_mutual_info_score. The arguments are the same as for ARI. However, in contrast to NID as presented in the GeoDa Cluster Book, a higher value for the adjusted mutual information score indicates closer similarity.
We first illustrate this for Ward’s agglomerative clustering and K-Means, passing numpy arrays of agg_labels and kmeans_labels.
nid = adjusted_mutual_info_score(np.array(agg_labels), np.array(kmeans_labels))
print(np.round(nid, 3))0.614
Finally, we run the same loop as for ARI to compute all pairwise NID scores.
allnid = np.zeros((h, h))
for i in range(h):
labi = np.array(labels[i])
for j in range(h):
labj = np.array(labels[j])
allnid[i,j] = adjusted_mutual_info_score(labi, labj)
dfnid = pd.DataFrame(allnid, columns = label_names, index = label_names)
print(np.round(dfnid, 3)) Hierarchical K-Means SCHC SKATER REDCAP AZP \
Hierarchical 1.000 0.614 0.257 0.219 0.267 0.306
K-Means 0.614 1.000 0.300 0.244 0.308 0.328
SCHC 0.257 0.300 1.000 0.548 0.897 0.581
SKATER 0.219 0.244 0.548 1.000 0.496 0.425
REDCAP 0.267 0.308 0.897 0.496 1.000 0.595
AZP 0.306 0.328 0.581 0.425 0.595 1.000
AZP_Initial 0.317 0.322 0.727 0.545 0.727 0.646
Max-p 0.177 0.191 0.388 0.423 0.372 0.445
AZP_Initial Max-p
Hierarchical 0.317 0.177
K-Means 0.322 0.191
SCHC 0.727 0.388
SKATER 0.545 0.423
REDCAP 0.727 0.372
AZP 0.646 0.445
AZP_Initial 1.000 0.367
Max-p 0.367 1.000
As for ARI, we find the closest correspondence between SCHC and Redcap.
14.5 Practice
We now have all the available tools to compare the various cluster solutions obtained in earlier Chapters. In addition, when an administrative regionalization is available (e.g., subdistricts in a city), the various cluster solutions can be compared to that truth.