The SPM Salford Predictive Modeler software suite offers several tools for clustering and segmentation including CART, Random Forests, and a classical statistical module CLUSTER. In this article we illustrate the use of these tools with the well known Boston Housing data set (pertaining to 1970s housing prices and neighborhood characteristics in the greater Boston area).
We open the data set, go to the model setup dialog to see:
Observe that we are using the CART analysis engine, we have set the Analysis Type to Unsupervised, we have checked ALL variables as predictors and we have not selected a Target (observe that the Target Variable box is empty). What CART will do here is take the original data we have and make a scrambled copy of it, calling the first version “Original” and the second “Copy” and will then build a CART tree to try to distinguish between the two versions of the data. The key intuition behind this style of analysis is that the “Copy” is intended to follow a roughly uniform structureless pattern of data whereas presumably the “Original” data does have structure. Terminal nodes of the CART tree that are particularly rich in “Original” data exhibit marked structure and can be considered regions of high density. If the nodes are of reasonable size they could be considered clusters in the conventional sense.
To control the size of the terminal nodes we use the limits dialog to set the smallest possible terminal node to 30 and the smallest splittable node to 70. There is nothing special about these settings, which have been chosen simply to ensure that no “cluster” be smaller than 30 records. The value of 70 chosen for the smallest parent node is more or less forced upon us once we have set the minimum terminal node of 30 since a parent gives rise to children. If each child contains at least 30 records the parent contains at least 60 and we allow the parent to have a little flexibility in the split by ensuring that it is a at least a little bit larger than 60.
The resulting tree, grown using cross-validation as the test method is shown below where the bright red nodes will contain the results of interest.
Double-clicking any node we can reveal its defining “rules” as, for example, terminal node 10 where we see a collection of neighborhoods that have below average crime rates (typically below the median), below average pollution levels, and are above average in distance from employment centers.
This “cluster” is made up of 200 neighborhoods, which is large enough to prevent it from being too distinctive. If we allow smaller terminal nodes it is likely that the most highly concentrated terminal nodes define more interesting segments.
What about the nodes that are least populated with “Original” records? What can they tell us about the data? One interpretation, when the “Original” records are truly rare is that we have identified anomalies (patterns which are unexpected in the data being analyzed). Below we hover our mouse over one of the blue terminal nodes to reveal a node that contains only 5 “Original” but 80 “Copy” records, suggesting that the pattern is indeed unusual.
The rule indicates that there are few neighborhoods close to the centers of employment with lower percentages of older buildings. This is something we might not have noticed during our exploration of this data in other ways.
Using CART Unsupervised Learning With Large Datasets
The method we have described can be used to find quite interesting segmentations in a two-stage process.
First, run CART in unsupervised mode allowing the tree to reach some moderate number of terminal nodes, such as 200.
Score the data assigning every record to its terminal node and compute the mean, median, or some other representative vector for each terminal node. If all variables are continuous for example, we could just represent each terminal node with a vector of mean values.
Now, we take this table of means, one for each terminal node, and treat it as a new data set to be input into a clustering procedure (hierarchical would make a lot of sense, given the small size of this data set).
Once we have arrived at a satisfactory clustering of the mean vectors we can use it to assign every record in the original data set to its appropriate cluster. How satisfactory this will be will depend in part on the variability of the actual data around the mean vectors.
Clustering With Random Forests
The only other data mining engine that works with our data scrambling technique is Random Forests. Its mechanism is quite different from CART in spite of the fact that it leverages the same contrast of an “Original” versus “Copy” data that is intended to more or less structureless. Random Forests clustering works via its construction of a proximity matrix. The details of this matrix are described in other Salford publications but the essential idea is quite simple. Take any pair of data records and drop both down every tree in the forest, counting the number of times the two records end up in the same terminal node. You might think this would occur very rarely but it will depend on the distribution of the data. If the two records end up in the same terminal node in every tree they must be very similar or even identical. If two records never end up in the same terminal node they are unlikely to have much in common (speaking of their data values). The reason is that each tree will involve a different set of conditions defining a terminal node, involving different subsets of variables. If two records never match in any tree this means that regardless of the subset of variables that we want to focus on they are different. The continuum between never matching and always matching defines a similarity metric that can be used to drive hierarchical clustering.
The proximity matrix is the principle mechanism of discovering clusters in Random Forests. Further, proximity matrices are optionally created for any Random Forests classification model meaning that the analyst could approach the challenge from various perspectives. The target variable could be “Original” versus “Copy” but it could also be any real variable existing in the original data allowing for models not involving any duplication and scrambling of data. Running different models could result in different proximity matrices and thus different clustering solutions.
A helpful facility in SPM Random Forests is the option to run a hierarchical clustering process immediately after the forest and proximity matrix have been constructed. This facility is available only from the command line and can be invoked with commands following this pattern:
LOPTIONS UNS = YES
KEEP CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS,
RAD, TAX, PT, B, LSTAT, MV
RF GO POST=YES JOIN=YES LINKAGE=WARD
The first command puts Random Forests into Unsupervised mode (UNS=YES), the KEEP command lists the variables permitted to enter the trees, the RF SAVE command indicates the root name which Random Forests will use to save several output files and the final command specifies that we want the clustering to proceed using WARD linkage.
There will not be any visual output but the output file with the word “join” in its name (it will be boston_unsup_join.csv in this example) will contain clustering solutions for 4, 6, and 10 clusters. Every record in your training data set will be assigned to one of the available clusters. Three columns in the output data file named CLUSTER_1_1, CLUSTER_1_2, and CLUSTER_1_3 contain the cluster assignments. For other numbers of clusters, at this point in time, you will need to run a clustering procedure outside of SPM.
This is a conventional implementation of the KMEANS algorithm and is included in SPM for convenience. Look for an updated version of this note after April 2014 to learn more about the advances included in all of our clustering machinery in the 2014 edition of SPM.