1 Introduction
The main goal of this book is to serve as a practical companion to the treatment of spatial clustering methods in the two GeoDa books by Luc Anselin, An Introduction to Spatial Data Science with GeoDa - Volume 1, Exploring Spatial Data (Anselin 2024a), and An Introduction to Spatial Data Science with GeoDa - Volume 2, Clustering Spatial Data (Anselin 2024b). In the remainder, these two volumes will be referred to as, respectively, the GeoDa Explore Book and the GeoDa Cluster Book.
The objective is to illustrate as much as possible how the functionality of GeoDa can be replicated using Python, based on the same empirical examples as in the original books. This introduces extensive functionality contained in the specialized package pygeoda as well as from the familiar scikit-learn ecosystem for machine learning.
The Companion contains only a minimal discussion of the actual methods. Instead, the focus is on replicating the examples in the GeoDa books for the same data sets. It is intended to be used in a lab environment or for self-study.
Very little computing background is assumed, beyond familiarity with basic Python commands and the essentials of pandas and geopandas data frames. All other functionality is developed from scratch. Each chapter follows a similar organization, starting with a list of the required Python packages and the data sets used in the examples. The remainder of the chapters typically follows the same outline as a matching chapter in the GeoDa books. Each chapter closes with some suggestions for further practice.
Not all the empirical illustrations in the GeoDa books can be fully replicated using Python code. Specifically, the powerful interactive linking and brushing of multiple maps and graphs cannot (yet) be implemented in a Python setup. However, counterparts of all the static illustrations have been provided. In almost all situations, the results are identical to those in the matching chapters of the GeoDa books. In a few instances, particularly in the use of pygeoda, minor differences in the architecture of the C++ and Python implementations may cause slight discrepancies.
The book is made up of thirteen chapters beyond this introduction. Chapter 2 provides a brief introduction to the mapping functionality that is available with geopandas. Chapters 3 and 4 cover local clusters, respectively multivariate local spatial clusters and density-based clustering. Chapters 5 and 6 deal with dimension reduction. The remaining chapters are devoted to so-called unsupervised machine learning in the form of clustering. Chapters 7 to 10 deal with classic (non-spatial) clustering methods, whereas the Chapters 11-13 cover situations where a spatial contiguity constraint is imposed. The book closes with a review of cluster validation measures.
1.1 Required Python Packages
Each chapter provides a list of the Python packages required for the empirical illustrations contained in it. Below, we provide a comprehensive list of all the packages used in the book. If not present, these may need to be installed by means of pip install -U package_name (use -U to make sure the latest version is obtained).
In the order of when they are first mentioned, the packages are, with their minimal version number and a brief description of the associated functionality:
geopandas(version 1.0.1): spatial data structures, basic mapsmatplotlib.pyplot(version 3.9.2): plotting and mapping functionsspatial-cluster-helper(version 0.1.2): utility functions to load data and interpret cluster resultsnumpy(version 2.0.2): matrix (array) manipulationspygeoda(version 0.0.8-1): specialized Python interface toGeoDaspatial weights, ESDA and clustering functionssklearn.cluster(version 1.5.1): density-based clusters, hierarchical clusters, partitioning clusters and spectral clustering from scikit-learnmatplotlib.lines(version 3.9.2): line plotting functionalitysklearn.preprocessing(version 1.5.1): data standardization from scikit-learnsklearn.metrics(version 1.5.1): pairwise distances, silhouette scores, adjusted Rand index and mutual information score from scikit-learnsklearn.decomposition(version 1.5.1): principal components from scikit-learnpandas(version 2.2.2): generic data framessklearn.manifolds(version 1.5.1): multidimensional scaling and TSNE from scikit-learnlibpysal(version 4.12.1): spatial weights manipulationscipy.cluster.hierarchy(version 1.14.1): dendrogram for hierarchical clusteringkmedoids(version 0.5.3.1): specialized package to carry out K-Medoids clustering
1.2 Sample Data Sets
The empirical illustrations use the same example data sets as the two GeoDa books. However, whereas for GeoDa these are available directly within the software, they must be installed as a working directory to obtain results for the Python code. The book assumes that the relevant data directories are in a ./datasets/ directory.
All the data sets are available for downloading from the GeoDaCenter sample data directory at https://geodacenter.github.io/data-and-lab/. In addition, the spatial_cluster_helper module contains the ensure_datasets function which is used in each chapter to check whether the required data sets are available and in the correct directory. If not present, the files are downloaded and installed.
The respective data sets are, in the order in which they are used:
- chicago_commpop: 2000 and 2010 population data for 77 Chicago Community Areas
- liquor: 571 liquor store locations in Chicago (2015)
- Chi-SDOH: socio-economic determinants of health for 791 Chicago census tracts in 2014, from the study by Kolak et al. (2020)
- italy_banks: performance measures for 261 Italian community banks (2011-2017), from Algeri et al. (2022)
- Chi-CCA: socio-economic characteristics for 77 Chicago Community Areas
- spirals: 300 two-dimensional point data to illustrate the spectral clustering method
- ceara: Zika and microcephaly incidence and socio-economic characteristics for 184 municipios in the Brazilian state of Ceará, from Amaral et al. (2019)
Further details on the respective data sets can be found in Chapter 1 of the GeoDa Explore Book.