13 Spatially Constrained Clustering - Partitioning Methods

In this Chapter, we continue with methods that impose hard spatial constraints in a clustering procedure. We consider two methods that use a partitioning clustering logic: AZP (automatic zoning procedure), Openshaw (1977) ,Openshaw and Rao (1995), and max-p, Duque, Anselin, and Rey (2012). Both are covered in Chapter 11 of the GeoDa Cluster Book.

As in Chapter 12, we need to rely on the functionality from GeoDa that is included in the pygeoda package. We again use the helper functions ensure_datasets, cluster_stats, cluster_center, cluster_fit and cluster_map from the spatial-cluster-helper package. We rely on geopandas and numpy for the basics. As in the previous chapter, we import time to get insight into the performance of the different algorithms.

We continue with the ceara sample data set for the empirical illustration.

Required Packages

geopandas, numpy, pygeoda, spatial-cluster-helper

Required Data Sets

ceara

13.1 Preliminaries

13.1.1 Import Required Modules

import geopandas as gpd
import numpy as np
import time

from spatial_cluster_helper import ensure_datasets, cluster_stats, \
            cluster_fit, cluster_center, cluster_map
import pygeoda

13.1.2 Load Data

We load the ceara.shp shape file and do a quick check of its contents. For this sample data, we use the argument encoding = 'utf-8' in the read_file function to account for the special characters in Brazilian Portuguese.

# Setting working folder:
#path = "/your/path/to/data/"
path = "./datasets/"
# Select the Ceará data:
shpfile = "ceara/ceara.shp"
# Load the data:
ensure_datasets(shpfile, folder_path = path)
dfs = gpd.read_file(path + shpfile, encoding = 'utf-8')
print(dfs.shape)
dfs.head(3)

(184, 36)

	code7	mun_name	state_init	area_km2	state_code	micro_code	micro_name	inc_mic_4q	inc_zik_2q	...	gdp	pop	gdpcap	popdens	ziq_2q	zika_d	mic_d	geometry
0	2300101.0	Abaiara	CE	180.833	23	23019	19ª Região Brejo Santo	0.000000	0.00	...	35974.0	10496.0	3.427	58.043	0.0	0.0	0.0	POLYGON ((5433729.65 9186242.97, 5433688.546 9...
1	2300150.0	Acarape	CE	130.002	23	23003	3ª Região Maracanaú	6.380399	0.00	...	68314.0	15338.0	4.454	117.983	0.0	0.0	1.0	POLYGON ((5476916.288 9533405.667, 5476798.561...
2	2300200.0	Acaraú	CE	842.471	23	23012	12ª Região Acaraú	0.000000	1.63	...	309490.0	57551.0	5.378	68.312	1.0	1.0	0.0	POLYGON ((5294389.783 9689469.144, 5294494.499...

3 rows × 36 columns

# the full set of variables
print(list(dfs.columns))

['code7', 'mun_name', 'state_init', 'area_km2', 'state_code', 'micro_code', 'micro_name', 'inc_mic_4q', 'inc_zik_3q', 'inc_zik_2q', 'inc_zik_1q', 'prim_care', 'ln_gdp', 'ln_pop', 'mobility', 'environ', 'housing', 'sanitation', 'infra', 'acu_zik_1q', 'acu_zik_2q', 'acu_zik_3q', 'pop_zikv', 'acu_mic_4q', 'pop_micro', 'lngdpcap', 'gdp', 'pop', 'gdpcap', 'popdens', 'zik_1q', 'ziq_2q', 'ziq_3q', 'zika_d', 'mic_d', 'geometry']

13.1.3 Variables

We continue to use the same set of variables to replicate the empirical illustration in Chapter 11 of the GeoDa Cluster Book, as listed in Table 12.1. As before, we specify the variables in the varlist list for later use.

varlist = ['mobility', 'environ', 'housing', 'sanitation', 'infra', 'gdpcap']

13.1.4 Pygeoda Data Preparation

We follow the same steps as in Chapter 12 to set up a data set in the pygeoda internal format as ceara_g and to create queen contiguity spatial weights (queen_w).

ceara_g = pygeoda.open(dfs)
queen_w = pygeoda.queen_weights(ceara_g)
queen_w

Weights Meta-data:
 number of observations:                  184
           is symmetric:                 True
               sparsity:  0.02953686200378072
        # min neighbors:                    1
        # max neighbors:                   13
       # mean neighbors:    5.434782608695652
     # median neighbors:                  5.0
           has isolates:                False

As in the previous Chapter, we create subsets with the relevant variables in both the pygeoda format (data_g) and as a GeoDataFrame (data).

data_g = ceara_g[varlist]
data = dfs[varlist]

We set the number of clusters to 12.

n_clusters = 12

13.2 AZP

The first spatially constrained partitioning method we consider is AZP. It is contained in pygeoda, which uses the same underlying code as GeoDa, except for some technical implementations related to memory management. Even though these differences are very low-level, they can influence the way the heuristic proceeds, since it moves sequentially through alternative configurations. The order in which these sequences are considered matters, and slight differences in the memory allocation can produce different results.

As a consequence, it is not possible to completely replicate the results from Chapter 11 in the GeoDa Cluster Book, given the differences between the low-level implementations in C++ and in Python. This is a general feature of any heuristic, where not only the algorithms matter, but also seemingly unrelated issues such as starting points, random numbers and the particular ordering of moves.

AZP is implemented as three different functions, corresponding to the greedy, tabu and simulated annealing optimization approaches (see Chapter 11 in the GeoDa Cluster Book for technical details). The matching functions are azp_greedy, azp_tabu, and azp_sa. Required arguments are the number of clusters, the spatial weights and the pygeoda data object.

13.2.1 AZP-Greedy

The first example uses greedy search, implemented in azp_greedy. We use the same approach as in the previous Chapter and employ the helper functions cluster_stats, cluster_fit, cluster_center and cluster_map to list the properties of the clusters (see Chapter 12 for technical details).

t0 = time.time()
ceara_clusters1 = pygeoda.azp_greedy(n_clusters, queen_w, data_g)
t1 = time.time()
tazp1 = t1 - t0
print("AZP Greedy Clustering Time: ", tazp1)
cluster_labels1 = np.array(ceara_clusters1['Clusters'])
c_stats = cluster_stats(cluster_labels1)
fit = cluster_fit(data, cluster_labels1, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels1)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

AZP Greedy Clustering Time:  0.03095412254333496
 Labels  Cardinality
      1           46
      2           37
      3           29
      4           25
      5           21
      6            7
      7            5
      8            5
      9            4
     10            2
     11            2
     12            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [124.036 115.162 139.302  71.455 105.922  16.194  16.279  26.967  53.98
   2.921  16.99    0.   ]
Total Within-cluster Sum of Squares (WSS): 689.208
Between-cluster Sum of Squares (BSS): 408.792
Ratio of BSS to TSS: 0.372
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.965    0.883    0.852       0.686  0.552   4.524
2           0.959    0.864    0.800       0.603  0.513   4.490
3           0.946    0.814    0.798       0.592  0.469   6.265
4           0.960    0.883    0.852       0.629  0.528   4.591
5           0.960    0.797    0.826       0.646  0.512   4.679
6           0.942    0.924    0.846       0.573  0.391   5.208
7           0.982    0.939    0.839       0.833  0.612   5.691
8           0.833    0.750    0.800       0.743  0.537  11.102
9           0.944    0.897    0.841       0.593  0.405  16.482
10          0.960    0.930    0.900       0.641  0.424   4.170
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.968    0.888    0.849       0.677  0.560   4.220
2           0.959    0.885    0.801       0.581  0.505   4.140
3           0.948    0.824    0.807       0.573  0.470   4.742
4           0.968    0.896    0.851       0.602  0.534   4.395
5           0.963    0.805    0.831       0.636  0.514   4.140
6           0.935    0.943    0.842       0.575  0.377   5.114
7           0.983    0.932    0.829       0.798  0.606   4.099
8           0.834    0.741    0.823       0.784  0.561   7.982
9           0.957    0.882    0.846       0.590  0.419   9.262
10          0.960    0.930    0.900       0.641  0.424   4.170
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

# Plot the clusters
cluster_map(dfs, cluster_labels1, figsize = (4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

13.2.2 AZP-Tabu

The Tabu algorithm for AZP is invoked with azp_tabu. In addition to the same arguments as for the basic AZP function, a tabu_length is required (the size of the tabu list of observations for which a swap is not possible), as well as the maximum number of non-improving moves, conv_tabu. In our illustration, these are set to respectively 50 and 25, as in Chapter 11 of the GeoDa Cluster Book. We use the same set of commands as before.

t0 = time.time()
ceara_clusters2 = pygeoda.azp_tabu(n_clusters, queen_w, data_g,
                                   tabu_length = 50, conv_tabu = 25)
t1 = time.time()
tazp2 = t1 - t0
print("AZP Tabu Clustering Time: ", tazp2)
cluster_labels2 = np.array(ceara_clusters2['Clusters'])
c_stats = cluster_stats(cluster_labels2)
fit = cluster_fit(data, cluster_labels2, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels2)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

AZP Tabu Clustering Time:  0.36464381217956543
 Labels  Cardinality
      1           50
      2           41
      3           26
      4           25
      5           14
      6           10
      7            5
      8            4
      9            4
     10            2
     11            2
     12            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [155.14  119.919  97.683  71.455  78.79   34.08   16.279  53.98   11.088
   2.921  16.99    0.   ]
Total Within-cluster Sum of Squares (WSS): 658.326
Between-cluster Sum of Squares (BSS): 439.674
Ratio of BSS to TSS: 0.4
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.959    0.860    0.803       0.608  0.502   4.575
2           0.968    0.904    0.849       0.693  0.538   4.533
3           0.956    0.780    0.835       0.644  0.542   4.636
4           0.960    0.883    0.852       0.629  0.528   4.591
5           0.922    0.759    0.779       0.580  0.465   6.315
6           0.947    0.909    0.834       0.550  0.409   7.222
7           0.982    0.939    0.839       0.833  0.612   5.691
8           0.944    0.897    0.841       0.593  0.405  16.482
9           0.833    0.766    0.818       0.792  0.574  12.600
10          0.960    0.930    0.900       0.641  0.424   4.170
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.957    0.878    0.808       0.581  0.495   4.322
2           0.970    0.902    0.844       0.680  0.542   4.242
3           0.960    0.774    0.838       0.630  0.564   4.134
4           0.968    0.896    0.851       0.602  0.534   4.395
5           0.933    0.763    0.785       0.581  0.458   4.802
6           0.949    0.911    0.836       0.535  0.382   5.281
7           0.983    0.932    0.829       0.798  0.606   4.099
8           0.957    0.882    0.846       0.590  0.419   9.262
9           0.835    0.766    0.824       0.799  0.583  11.557
10          0.960    0.930    0.900       0.641  0.424   4.170
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

# Plot the clusters
cluster_map(dfs, cluster_labels2, figsize=(4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

13.2.3 AZP-Simulated Annealing

The simulated annealing algorithm for AZP is invoked with azp_sa. In addition to the same arguments as for the basic AZP function, a cooling_rate is required (the rate at which the simulated annealing temperature is allowed to decrease), as well as the maximum number of iterations allowed for each swap, sa_maxit. In our illustration, these are set to, respectively, 0.8 and 5. We use the same set of functions as before.

Note that the results differ slightly from the ones reported in Chapter 11 of the GeoDa Cluster Book even though the parameters are the same.

t0 = time.time()
ceara_clusters3 = pygeoda.azp_sa(n_clusters, queen_w, data_g,
                                   cooling_rate = 0.8, sa_maxit = 5)
t1 = time.time()
tazp3 = t1 - t0
print("AZP SA Clustering Time: ", tazp3)
cluster_labels3 = np.array(ceara_clusters3['Clusters'])
c_stats = cluster_stats(cluster_labels3)
fit = cluster_fit(data, cluster_labels3, n_clusters, 
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels3)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

AZP SA Clustering Time:  3.248257637023926
 Labels  Cardinality
      1           71
      2           33
      3           28
      4           19
      5           12
      6            5
      7            5
      8            4
      9            2
     10            2
     11            2
     12            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [204.08  101.845 129.825  50.521  41.189   7.637  26.967  53.98    2.921
   1.109  16.99    0.   ]
Total Within-cluster Sum of Squares (WSS): 637.065
Between-cluster Sum of Squares (BSS): 460.935
Ratio of BSS to TSS: 0.42
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.965    0.903    0.847       0.674  0.533   4.621
2           0.958    0.860    0.795       0.586  0.505   4.473
3           0.938    0.816    0.802       0.565  0.453   6.141
4           0.953    0.732    0.834       0.615  0.545   4.734
5           0.981    0.879    0.848       0.748  0.594   5.154
6           0.974    0.896    0.831       0.746  0.452   5.363
7           0.833    0.750    0.800       0.743  0.537  11.102
8           0.944    0.897    0.841       0.593  0.405  16.482
9           0.960    0.930    0.900       0.641  0.424   4.170
10          0.917    0.921    0.829       0.562  0.369   3.810
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.968    0.905    0.846       0.671  0.526   4.319
2           0.959    0.883    0.800       0.571  0.498   4.140
3           0.940    0.811    0.808       0.567  0.464   4.738
4           0.958    0.720    0.833       0.608  0.563   4.140
5           0.984    0.909    0.851       0.752  0.584   4.234
6           0.983    0.906    0.836       0.775  0.409   4.985
7           0.834    0.741    0.823       0.784  0.561   7.982
8           0.957    0.882    0.846       0.590  0.419   9.262
9           0.960    0.930    0.900       0.641  0.424   4.170
10          0.917    0.921    0.829       0.562  0.369   3.810
11          0.974    0.657    0.781       0.823  0.693   5.432
12          0.936    0.789    0.794       0.545  0.425  27.625

# Plot the clusters
cluster_map(dfs, cluster_labels3, figsize = (4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

Figure 13.3: AZP Simulated Annealing cluster map

13.2.4 ARiSel

As in K-means, an alternative to a random initial feasible solution is to use a k-means++ like logic. This is implemented in the ARiSel approach, through the additional inits argument to any azp function. In the example below, we set inits = 50. We use azp_sa with a cooling_rate of 0.8 and sa_maxit as 5.

t0 = time.time()
ceara_clusters4 = pygeoda.azp_sa(n_clusters, queen_w, data_g, 
                            cooling_rate = 0.8, sa_maxit = 5, inits = 50)
t1 = time.time()
tazp4 = t1 - t0
print("AZP ARiSel Clustering Time: ", tazp4)
cluster_labels4 = np.array(ceara_clusters4['Clusters'])
c_stats = cluster_stats(cluster_labels4)
fit = cluster_fit(data, cluster_labels4, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels4)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

AZP ARiSel Clustering Time:  5.351151943206787
 Labels  Cardinality
      1           67
      2           29
      3           19
      4           19
      5           14
      6           13
      7            6
      8            6
      9            5
     10            4
     11            1
     12            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [224.399  88.688  50.521  51.732  48.429  44.554  39.958   8.893  16.279
  11.088   0.      0.   ]
Total Within-cluster Sum of Squares (WSS): 584.54
Between-cluster Sum of Squares (BSS): 513.46
Ratio of BSS to TSS: 0.468
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.968    0.888    0.844       0.692  0.552   4.663
2           0.954    0.858    0.795       0.586  0.510   4.513
3           0.953    0.732    0.834       0.615  0.545   4.734
4           0.956    0.897    0.855       0.604  0.480   4.272
5           0.960    0.821    0.806       0.644  0.464   4.450
6           0.944    0.883    0.842       0.574  0.416   7.300
7           0.902    0.772    0.748       0.615  0.447   8.392
8           0.939    0.834    0.783       0.486  0.454   5.613
9           0.982    0.939    0.839       0.833  0.612   5.691
10          0.833    0.766    0.818       0.792  0.574  12.600
11          0.957    0.966    0.857       0.593  0.357  40.018
12          0.936    0.789    0.794       0.545  0.425  27.625

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.970    0.898    0.845       0.678  0.551   4.390
2           0.952    0.883    0.792       0.564  0.498   4.183
3           0.958    0.720    0.833       0.608  0.563   4.140
4           0.964    0.896    0.851       0.585  0.491   4.097
5           0.959    0.835    0.812       0.618  0.469   4.355
6           0.951    0.896    0.836       0.579  0.424   6.171
7           0.905    0.779    0.750       0.605  0.444   5.140
8           0.939    0.840    0.790       0.494  0.468   4.802
9           0.983    0.932    0.829       0.798  0.606   4.099
10          0.835    0.766    0.824       0.799  0.583  11.557
11          0.957    0.966    0.857       0.593  0.357  40.018
12          0.936    0.789    0.794       0.545  0.425  27.625

cluster_map(dfs, cluster_labels4, figsize=(4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

13.2.5 AZP with Initial Clustering Result

As mentioned in the discussion of hierarchical spatially constrained solutions in Chapter 12, one of the drawbacks of those methods is that observations tend to be trapped in a branch of the dendrogram or minimum spanning tree. Since AZP implements swapping of observations, we can use the endpoint of one of the hierarchical methods as the feasible initial solution for AZP. In many instances (but not all), this improves on the hierarchical solution and sometimes also obtains better solutions than straight AZP.

In all these illustrations, it is important to keep in mind the sensitivity of the results to the various tuning parameters (which are under the analyst’s control) as well as hardware implementations (which are not).

We illustrate this approach by using the end solution of SCHC as the initial solution for AZP, set by means of the init_regions argument.

ceara_clusters = pygeoda.schc(n_clusters, queen_w, data_g, "ward")
schc_cluster_labels = ceara_clusters['Clusters']
t0 = time.time()
ceara_clusters5 = pygeoda.azp_sa(n_clusters, queen_w, data_g, 
                                cooling_rate = 0.8, sa_maxit = 5, 
                                init_regions = schc_cluster_labels)
t1 = time.time()
tazp5 = t1 - t0 
print("AZP-SCHC Clustering Time: ", tazp5)
cluster_labels5 = np.array(ceara_clusters5['Clusters'])
c_stats = cluster_stats(cluster_labels5)
fit = cluster_fit(data, cluster_labels5, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels5)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

AZP-SCHC Clustering Time:  3.6895642280578613
 Labels  Cardinality
      1           88
      2           43
      3           15
      4           14
      5            8
      6            4
      7            4
      8            3
      9            2
     10            1
     11            1
     12            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [246.863 134.79   30.167  49.716  31.033  14.614  11.088   5.956  16.99
   0.      0.      0.   ]
Total Within-cluster Sum of Squares (WSS): 541.218
Between-cluster Sum of Squares (BSS): 556.782
Ratio of BSS to TSS: 0.507
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.964    0.895    0.847       0.670  0.526   4.600
2           0.950    0.836    0.790       0.578  0.494   4.460
3           0.952    0.719    0.842       0.625  0.575   4.887
4           0.944    0.883    0.837       0.563  0.414   7.450
5           0.984    0.896    0.845       0.786  0.613   5.042
6           0.968    0.835    0.784       0.578  0.413   4.327
7           0.833    0.766    0.818       0.792  0.574  12.600
8           0.874    0.711    0.742       0.612  0.420   5.307
9           0.974    0.657    0.781       0.823  0.693   5.432
10          0.957    0.966    0.857       0.593  0.357  40.018
11          0.936    0.789    0.794       0.545  0.425  27.625
12          0.951    0.824    0.771       0.573  0.443  25.464

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.968    0.896    0.846       0.669  0.523   4.380
2           0.950    0.859    0.791       0.562  0.490   4.229
3           0.950    0.709    0.845       0.608  0.569   3.923
4           0.949    0.889    0.836       0.577  0.419   6.778
5           0.986    0.924    0.844       0.798  0.616   3.992
6           0.966    0.829    0.784       0.604  0.439   4.299
7           0.835    0.766    0.824       0.799  0.583  11.557
8           0.885    0.687    0.730       0.622  0.427   5.110
9           0.974    0.657    0.781       0.823  0.693   5.432
10          0.957    0.966    0.857       0.593  0.357  40.018
11          0.936    0.789    0.794       0.545  0.425  27.625
12          0.951    0.824    0.771       0.573  0.443  25.464

cluster_map(dfs, cluster_labels5, figsize=(4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

Figure 13.5: AZP with SCHC initialization

13.2.6 Comparing AZP Implementations

To close the discussion of AZP, we provide a brief summary of the spatial layout and measure of fit for the greedy algorithm, the tabu search, and two simulated annealing results, one with 50 initial re-runs and one using SCHC as the starting point. The latter solution clearly achieves the highest BSS/TSS ratio.

# Comparing the results, except the basic AZP (SA) 
titles = ["AZP (Greedy)", "AZP (Tabu Search)",
                   "AZP (SA) (50 re-runs)", "AZP (SA) (SCHC initial regions)"]

cluster_map(dfs, [cluster_labels1, cluster_labels2, cluster_labels4, 
            cluster_labels5], 
            title = titles, cmap = 'tab20', 
            grid_shape = (2, 2), figsize = (8, 8),legend_fontsize=8)

ceara_clusters = [ceara_clusters1, ceara_clusters2, 
                  ceara_clusters4, ceara_clusters5]
print("The ratio of between to total sum of squares:")
for i, cluster in enumerate(ceara_clusters):
    print(titles[i],
        f": {np.round(cluster['The ratio of between to total sum of squares'], 
                      4)}")

The ratio of between to total sum of squares:
AZP (Greedy) : 0.3723
AZP (Tabu Search) : 0.4004
AZP (SA) (50 re-runs) : 0.4676
AZP (SA) (SCHC initial regions) : 0.5071

13.3 Max-p Regions

Up to this point, we had to specify the number of clusters beforehand as the first argument to be passed to the pygeoda clustering routines. The max-p approach takes a different perspective. It attempts to find the largest number of clusters possible (the max p) within a given constraint on the cluster size. Without such a constraint, the solution would always equal the number of observations.

The minimum cluster size, or min_bound is typically a function of the total for a spatially extensive variable, like population or total number of housing units. However, it can be set differently as well. Keep in mind that the larger the minimum bound, the smaller the number of clusters that will be found, and the other way around. When the minimum size is not determined by some fixed policy criterion, it may therefore be good to do some experimentation.

The max-p algorithm is essentially AZP applied to an initial feasible solution that follows a heuristic to maximize the number of regions that satisfy the minimum bounds constraint. As before, several random starting points should be explored as the results are quite sensitive to this initial solution.

We illustrate max-p for the Ceará example using simulated annealing with the min_bound set to a percentage of the total population. As for AZP, three functions are available, maxp_greedy, maxp_tabu and maxp_sa. The AZP-related arguments are the same as for the core AZP functions. The n_clusters argument is (obviously) no longer needed. Instead, the bound_variable defines the spatially extensive variable for the minimum constraint, and min_bound sets its actual value. Here, we take 'pop' as the variable (from the geodataframe), and compute the percentage for the bound using sum() multiplied by a fraction.

In the first example, we use maxp_greedy with a population minimum of 10% of the total, or 845,238. This yields a solution with 6 clusters, achieving a miserable BSS/TSS of 0.26.

t0 = time.time()
ceara_clusters6 = pygeoda.maxp_greedy(queen_w, data_g, 
                                      bound_variable = dfs['pop'], 
                                      min_bound = dfs['pop'].sum()*0.1)
t1 = time.time()
tmaxp1 = t1 - t0
print("Max-p (10%) Clustering Time: ", tmaxp1)
cluster_labels6 = np.array(ceara_clusters6['Clusters'])
c_stats = cluster_stats(cluster_labels6)
fit = cluster_fit(data, cluster_labels6, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels6)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

Max-p (10%) Clustering Time:  0.12100005149841309
 Labels  Cardinality
      1           64
      2           45
      3           35
      4           30
      5            8
      6            2

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [283.714 150.575 190.247 103.556  69.006  15.465]
Total Within-cluster Sum of Squares (WSS): 812.563
Between-cluster Sum of Squares (BSS): 285.437
Ratio of BSS to TSS: 0.26
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.964    0.847    0.841       0.673  0.545   4.521
2           0.951    0.844    0.792       0.577  0.494   5.009
3           0.951    0.892    0.850       0.644  0.456   6.236
4           0.974    0.878    0.841       0.657  0.554   4.818
5           0.871    0.759    0.777       0.682  0.490  10.228
6           0.891    0.806    0.809       0.680  0.515  21.378

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.965    0.875    0.843       0.661  0.556   4.202
2           0.950    0.862    0.793       0.562  0.490   4.386
3           0.957    0.896    0.850       0.613  0.452   4.985
4           0.978    0.893    0.849       0.620  0.550   4.436
5           0.866    0.750    0.777       0.680  0.471   7.154
6           0.891    0.806    0.809       0.680  0.515  21.378

cluster_map(dfs, cluster_labels6, figsize=(4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

Figure 13.7: Max-p with 10% population bound

13.3.1 Sensitivity Analysis

The importance of sensitivity analysis for max-p cannot be understated. To illustrate the different outcomes that can be obtained by tuning the various parameters, we give two examples using simulated annealing with a minimum bound of 5% and 3% respectively.

We first consider 5%, i.e., a population minimum of 422,619. We set the number of iterations (iterations) to 9999, use a cooling rate of 0.9 and maxit of 5. Note that this achieves a p of 13, compared to 12 in Figure 11.28 for GeoDa, but the BSS/TSS ratio is 0.321, relative to 0.385 in GeoDa.

t0 = time.time()
ceara_clusters7 = pygeoda.maxp_sa(queen_w, data_g, 
                                      bound_variable = dfs['pop'], 
                                      min_bound = dfs['pop'].sum()*0.05,
                                      iterations = 9999,
                                      cooling_rate = 0.9,
                                      sa_maxit = 5)
t1 = time.time()    
tmaxp2 = t1 - t0
print("Max-p (5%) Clustering Time: ", tmaxp2)
cluster_labels7 = np.array(ceara_clusters7['Clusters'])
c_stats = cluster_stats(cluster_labels7)
fit = cluster_fit(data, cluster_labels7, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels7)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

Max-p (5%) Clustering Time:  7.94816780090332
 Labels  Cardinality
      1           24
      2           23
      3           20
      4           19
      5           18
      6           18
      7           17
      8           14
      9           11
     10            8
     11            7
     12            3
     13            2

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [ 74.807  83.702  32.777  95.473  65.341  69.935  44.14  103.769  42.53
  36.253  68.153  26.218   2.151]
Total Within-cluster Sum of Squares (WSS): 745.247
Between-cluster Sum of Squares (BSS): 352.753
Ratio of BSS to TSS: 0.321
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.959    0.875    0.845       0.618  0.484   4.593
2           0.963    0.831    0.832       0.614  0.487   4.154
3           0.961    0.907    0.847       0.723  0.548   4.552
4           0.965    0.817    0.846       0.682  0.597   4.648
5           0.941    0.826    0.793       0.616  0.483   4.490
6           0.966    0.811    0.808       0.610  0.540   4.227
7           0.958    0.908    0.809       0.593  0.527   5.050
8           0.946    0.911    0.846       0.580  0.415   9.707
9           0.973    0.908    0.837       0.738  0.570   5.209
10          0.965    0.849    0.839       0.716  0.548   5.356
11          0.906    0.763    0.776       0.568  0.439  10.761
12          0.889    0.764    0.783       0.651  0.456  13.360
13          0.835    0.807    0.826       0.838  0.619  11.404

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.964    0.870    0.845       0.586  0.495   4.431
2           0.964    0.837    0.827       0.602  0.510   4.058
3           0.965    0.900    0.845       0.708  0.542   4.444
4           0.968    0.855    0.851       0.644  0.590   4.338
5           0.940    0.825    0.793       0.588  0.479   4.342
6           0.972    0.835    0.810       0.579  0.554   3.992
7           0.954    0.900    0.807       0.594  0.505   4.649
8           0.956    0.914    0.839       0.583  0.419   7.266
9           0.976    0.928    0.845       0.756  0.574   4.394
10          0.964    0.931    0.838       0.680  0.578   4.864
11          0.931    0.773    0.788       0.545  0.425   5.110
12          0.903    0.758    0.783       0.669  0.443   7.982
13          0.835    0.807    0.826       0.838  0.619  11.404

cluster_map(dfs, cluster_labels7, figsize=(4, 4), 
            title = "", cmap = 'tab20',legend_fontsize=8)

Figure 13.8: Max-p with 5% population bound

In the final example, we set the population minimum to 3%, or 253,571, with the same SA parameters. This yields 19 clusters, with a BSS/TSS ratio of 0.436, similar to the solution obtained with GeoDa (0.439). This again highlights the importance of experimentation with the various parameters.

t0 = time.time()
ceara_clusters8 = pygeoda.maxp_sa(queen_w, data_g, 
                                      bound_variable = dfs['pop'], 
                                      min_bound = dfs['pop'].sum()*0.03,
                                      iterations = 9999,
                                      cooling_rate = 0.9,
                                      sa_maxit = 5)
t1 = time.time()
tmaxp3 = t1 - t0
print("Max-p (3%) Clustering Time: ", tmaxp3)
cluster_labels8 = np.array(ceara_clusters8['Clusters'])
c_stats = cluster_stats(cluster_labels8)
fit = cluster_fit(data, cluster_labels8, n_clusters,
                  correct = True, printopt = True)
clust_means, clust_medians = cluster_center(data, cluster_labels8)
print("Cluster Means:\n", np.round(clust_means, 3))
print("\nCluster Medians:\n", np.round(clust_medians, 3))

Max-p (3%) Clustering Time:  10.129745960235596
 Labels  Cardinality
      1           33
      2           16
      3           14
      4           13
      5           12
      6           12
      7           12
      8           10
      9           10
     10           10
     11            8
     12            7
     13            7
     14            6
     15            5
     16            4
     17            2
     18            2
     19            1

Total Sum of Squares (TSS): 1098.0
Within-cluster Sum of Squares (WSS) for each cluster: [71.542 43.964 51.354 35.693 57.527 25.585 33.225 20.233 26.853 31.334
 15.088 84.027 28.591 39.958 16.279  6.894 15.465  5.476  0.   ]
Total Within-cluster Sum of Squares (WSS): 609.089
Between-cluster Sum of Squares (BSS): 488.911
Ratio of BSS to TSS: 0.445
Cluster Means:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.963    0.904    0.853       0.703  0.543   4.539
2           0.955    0.896    0.853       0.589  0.474   4.279
3           0.972    0.909    0.840       0.675  0.528   4.158
4           0.952    0.825    0.795       0.595  0.458   4.129
5           0.972    0.830    0.826       0.738  0.537   5.244
6           0.974    0.861    0.846       0.653  0.572   4.743
7           0.954    0.887    0.805       0.612  0.550   5.068
8           0.952    0.761    0.849       0.626  0.579   4.278
9           0.938    0.876    0.825       0.526  0.418   6.738
10          0.956    0.784    0.776       0.571  0.509   4.188
11          0.956    0.914    0.810       0.582  0.482   4.655
12          0.949    0.865    0.836       0.576  0.425  12.688
13          0.967    0.783    0.807       0.611  0.471   5.202
14          0.902    0.772    0.748       0.615  0.447   8.392
15          0.982    0.939    0.839       0.833  0.612   5.691
16          0.939    0.666    0.838       0.618  0.567   4.999
17          0.891    0.806    0.809       0.680  0.515  21.378
18          0.836    0.766    0.827       0.822  0.597  13.644
19          0.814    0.709    0.795       0.710  0.497   7.982

Cluster Medians:
          mobility  environ  housing  sanitation  infra  gdpcap
cluster                                                       
1           0.966    0.905    0.856       0.698  0.543   4.390
2           0.964    0.896    0.851       0.577  0.461   4.208
3           0.974    0.918    0.835       0.635  0.518   3.993
4           0.954    0.827    0.810       0.568  0.467   4.229
5           0.971    0.861    0.838       0.777  0.525   5.160
6           0.980    0.878    0.852       0.634  0.562   4.387
7           0.950    0.887    0.804       0.598  0.552   4.635
8           0.950    0.774    0.853       0.617  0.580   3.888
9           0.940    0.870    0.827       0.510  0.419   6.024
10          0.963    0.765    0.782       0.554  0.518   3.978
11          0.956    0.928    0.810       0.566  0.483   4.659
12          0.957    0.882    0.836       0.575  0.434   7.957
13          0.965    0.736    0.805       0.638  0.491   4.313
14          0.905    0.779    0.750       0.605  0.444   5.140
15          0.983    0.932    0.829       0.798  0.606   4.099
16          0.939    0.664    0.832       0.595  0.590   4.276
17          0.891    0.806    0.809       0.680  0.515  21.378
18          0.836    0.766    0.827       0.822  0.597  13.644
19          0.814    0.709    0.795       0.710  0.497   7.982

cluster_map(dfs, cluster_labels8, figsize = (5, 5), 
            title = "", cmap = 'tab20',legend_fontsize=8)

Figure 13.9: Max-p with 3% population bound

13.4 Practice

We can now use the max-p approach to compare the previous solutions with a pre-set number of clusters to what is obtained when this becomes a parameter to be optimized. In each instance, considerable sensitivity analysis is required.