This blog post is extracted from one of Salford Systems' video tutorial lectures offered by Dan Steinberg. Take some time out of your day to improve your knowldege of CART.
This video tutorial will help you:
- Understand how CART grows a tree
- What splitters, competitors and surrogates are
- Why they are important when building a CART model
If you are more of a reader, check out the transcript here:
Welcome to another in our series of Salford Systems online training videos. This is Dan Steinberg. Today we will be discussing CART competitors and surrogates. The topics that we'll cover today will include: how does CART grow a tree, what is a splitter, what is a competitor, why do we want to pay attention to competitors, what is a surrogate, what makes a good surrogate, why are surrogates important, surrogates in deployment, and surrogates and variable importance.
If you haven't already done so, let me urge you to visit our website, download the software that will allow you to run along with us and also download these sample datasets.
So we're going to start discussing the CART splitting process and in order to get a handle on that, we're going to open the data set here and I'm going to open the Eurotelcomini excel file. So let's just click on the modeling option here and let's choose response as our dependent variable and because we're not trying to do a real example here, that is we're not trying to do a real analysis, we just want to illustrate some features of CART let's just pick the telephone bill here, the telephone bill is our one and only predictor.
Let' go ahead and click the start button. We'll see something interesting here, only one split has been made, but that's all right. What we want to do here is understand how CART gets started. So let's go ahead and look inside the root node, this is where we begin. 704 cases of 0 and 126 cases of 1s, the 1s are the responders and what actually happened here, what CART did is it looked at the variable telephone bill.
It tried to find the value of the splitter, in this case the number that we're going to use here to make our dividing line. What is the best value for this split point? So CART comes back and tells us that it is $50 in this case, and notice what happens, we get a 10.8 over here and we get a 22 over here, both of these numbers are quite different from the 15 that we see over here. So what CART did is it looked for a variable, in this case there was only one to look at. A value, in this case there were several values and it chose the one which was best and we'll talk a little more about what best means here but intuitively, what best is trying to get at is a large separation between the two groups.
So normally, in a more complex CART and in any CART tree we typically want standard splits which are based on only one predictor, we have other splits but those will be discussed on another session and we're going to create a database rule, which looks like this. A record goes left if the splitter variable is less than or equal to some split value. So some examples, for example, we go left if age is less than 35 years old, if the credit score is less than 700, if the telephone bill is less than $50. These are typical of the CART database rules which are used to divide a block of data into two parts. So this is what we just saw when we were looking at the Salford Predictive Modeling software running live. Now what would have happened if we had split that root node at a different value? Well we can actually arrange for that and we can arrange for that in the software by going to the force tab. So if we go to the force split over here and again we'll use the telephone bill because that is the only variable we want to look at. We want to set the split value and change it. Over here we get a listing of what the split value are that are potential values in the data. So up here let's type in 25 and see what happens. And what we get now, we just look at the tree details and focus on the first split.
Notice that we have a valid database rule and the rule is such that yes some records do fall below the threshold and some others fall above the threshold so it is a technically valid split. In terms of what it accomplishes for us, it's minimal because here we see a 14.3% response rate and here we see a 15.5, sure they're different, but they are nowhere near as different as this other split over here. So what CART will do in the process of looking for a split is it will experiment with all the values that are available.
Now in this particular case we did not have that many. But if we had a 1000 different values for the telephone bill, we're actually going to split a 1000 different times and each time we're going to look at this division over here and CART will use one of several formulas to actually evaluate the goodness of split, but intuitively what CART is looking for is this kind of wide separation between the two children and we're going to be able to rank everyone of our splitters everyone of our split points on some goodness of split and we're going to end up choosing the value which is best.
On the criterion we happen to be using and in this case the default for CART, the criterion is based on the genie measure of improvement and that is something we discuss in detail in another session. Ok so let's closed these now and go back to the PowerPoint slides and here we have a screenshot of what we were speaking about before and a comparison of the two alternative splits. Normally in CART you're not going to see the alternatives, but in this particular case we have created those alternatives for ourselves.
So far we've discussed the CART process of splitting using a continuous variable, but that is not the only possibility. The splitter variables need not be numeric, they can be text so we can have words like good or bad or the names of cities where an individual might have lived. London, Madrid or Paris. If we're looking at some medical information, we might have a diagnosis code and if we treat that this might be coded as a number but we want to indicate that it is a categorical variable because there is nothing about diagnosis 111 which would suggest that it is 3 times approximately the value of 35, it's just a different number a different coding and so we need to forget about the numeric order and treat these as different labels. CART is fine with creating splits like that and what it does it creates a split that follows this kind of pattern that we see here.
Here is an example that comes from our dataset, so let's go back to our dataset and make that example happen. We'll go over here. Let's choose the city variable as our predictor forget about the telephone bill right now, but let's make sure that the city variable is declared categorical so we're not now going to try to split the way we would an ordinary continuous variable and let's have a look and see what we get here.
We have the conventional CART familiar method, which is a rule that tells us if we go left and when we go right. But there is something different here is that we won't use an inequality instead we use an explicit listing. Over here we say you go to the left if the city is coded as number 1, number 4, or number 5 and you go right if the city is coded 2 or 3. So we actually have to explicitly list the values of the variable that send the data to one of the partitions or the other. Again a screen shot of the same thing.
So the CART mechanism for splitting data is always the same. We are given a block of data, this could be all of our data and we are starting from scratch. Could be a small part of our data obtained after already doing a lot of slicing and dicing, this would be slicing and dicing done by CART. When we work with a block of data, we do not take into account how we got to that block of data, we are completely memoryless. We do not consider any information which might be available outside of the block of data, we remain totally agnostic to anything outside this particular chunk of data and the block of data to be analyzed is our entire universe and nothing else can exist for us, this means we always process every block of data in exactly the same way and that is what makes this process so efficient and such an interesting technology.
Now there are a few things that have to be the case for our data in order to do any splitting. Remember, we are now looking at a case with two value dependent variable and we are trying to make predictions for it, so for our block of data to be splitable first of all it must contain a sufficient number of data records, we get to decide what that number must be obviously it can't be less than two records, that is the smallest set of data that we can possibly dream of splitting but in a larger database we can set the minimum a little bit higher. We have experimented with numbers like 10, 20, 50, 100, 200 we haven't usually experimented with larger values but there is nothing preventing you from doing so. But if you're working with a small database such as those encountered in biomedical research, say containing 200 records sometimes even less, you will want to allow the minimum size to be small. If you set the minimum large, then you will stop the tree growing process before it barely gets started. If you are working with hundreds of thousands or millions of records there is no harm in trying a larger minimum.
CART Competitors and Surrogates
We're finally getting to the point to describe what a competitor is. Here we are talking about making the split. To split this block of data, and we'll start referring to the splitting of the block of data splitting a node we're going to search every available predictor.
We ran a couple of examples where we only ran one predictor and that of course is going to be the exception and hardly the rule. In a typical analysis we may have 10, 50, 100, 1000, maybe 100,000 possible predictors. What CART is going to do, it's going to look at every predictor and for every predictor it's going to make a trial split at every distinct value for that predictor. Recall before, we were looking at splits at the value of the telephone bill, the optimum was made at 50, we also made a trial split at 25 just to see what would happen and we saw at the value 25, it wasn't a very good split. But at the value of 50, it was an excellent split. While that process is going to be followed for every variable, for every possible value, of every variable. So for every trail split we compute a goodness of split measure, which in the CART terminology will be referred to as an improvement.
For each predictor we find the split value that yields the best improvement for that splitter if that was the only variable available to us. Once we've done that for every variable we can then rank the splitters in descending order then use the overall best splitter in order to actually make the partition of data, and that is the process that CART goes to move from the node that it is working on now to the generation of two child nodes.
So the ranked list of splitters is also known as the competitor list. CART always computes the entire list, no matter how many variables you have in your dataset, if you're using them all, then CART has to search all of them and therefore CART always knows the complete ranking of every variable it looks at. Now to save space CART normally only displays the top 5 competitors within a node, but you can change that in order to see more. But there is an exception and that the root node at the top of the tree always displays all of the competitors.
Let's continue our example now by going back to the modeling dialogue. I've already prepared this, but let me point out what I've already done here. First of all, I selected all the predictors, by doing this. Then I removed the record ID, because that is a variable that we don't want in the model, everything else is fine. Again, confirming the response is listed as the target variable and then the only other thing I had to make sure of is that the city and marital status were coded as categorical because in the data, which is quite old, therefore didn't contain any text information we used numbers to represent those particular statuses.
Once I have that set up then all I need to do is click on the start button and I've got my results and what I'm really interested now is in the root node, so let's go ahead and click on this left edge of the performance curve so that is easier for us to see where the root node is and now let's double click on that node.
You can double click on any node and get the similar display, but for right now we're looking at the root. So we see three tabs here, and we also see three areas that are being displayed under competitors and surrogates. So let's look at this list of competitors over here. We were told that the root node was split by this variable over here, a price that was charged for the handset, then we see a competitor, a telephone bill is listed as a competitor, then another pricing variable that was listed as a second ranked competitor, then city, then age, and finally this variable travel time.
The graph represents the performances of these particular variables and the graph is there to give you a quick visual feel for whether the winning variable that is the variable that was ultimately used to make the split was very much better than the second best variable. You can see over here, that in fact the scores are relatively close to each other, based on improvement over here.
Let's have a closer look, these numbers are often pretty small and I don't mean font size here but the actual value. We will frequently see numbers that look like .025, .05, etc. You just have to get use to ignoring the leading 0s if there are any. And I will just read this score here as 253 and I'll read the second score as 208. So you can see there is a reasonable drop from going to 250 to 208. By the time you get to the third variable where the score is 81 it is dramatically less. We see that the top variable does have, proportionately speaking a somewhat higher score than second best, which is quite a bit higher than third best.
The full list of splitters gives us a little bit of additional information. We get, first of all, the winning variable, that's the main splitter at the top, and we get the number one competitor, the number two competitor, etc. That list goes all the way to the ninth variable. In this data set, the improvement scores. We are also shown the number going left and the number going right, and the number missing. This is useful information for us for the following reason.
If you look at a variable, like for example as made by the pager variable down here, which does get a low ranking but look at the nature of the split only 63 records go to the right and 761 go to the left, that's a very unbalanced split. That may or may not be something you want to see. So there's room here for you to impose some judgment especially if you were to see that the winning splitter at a very unequal partition of the data and that is something you didn't want to see.
In addition, the number of missing values that are found in the data set at the root node for that variable and again it turns out that the winner has zero missing so there's nothing to complain about there, but we can see that there are other variables that have quite a few missing for example the travel time variable over here is missing 180 times out of 830 so that's on the order of one quarter and that may be too much and we may not want to allow a variable to be a splitter even if it has certain attractive characteristics if it turns out that it is more missing than we feel comfortable using that high up in the tree. There's a lot that we can say about missing values and how CART handles them and the controls you have, but right now we're talking about the judgmental dimension where you can look over these results and then either decide that you are happy with everything that CART did or you see something that you would like to do something about and perhaps modify the way in which the whole analysis goes forward.
Why Is This Important?
Why do we care about competitor splits? First of all, as we said before it is useful to know if the best splitter is far better than all the rest or only slightly better. If it is only slightly better, than we may even consider changing things and even dropping the top variable or taking other actions, which would prevent that variable from showing up in the root. It is useful to know which predictors show up near the top. If we have a group of variables that are close together in competitive scores it's useful to observe whether these are very different from each other, in terms of what they're measuring or if they are reflecting the same underlying information. For example, if I am looking at financial data about individuals and I have both a credit score and a bankruptcy score coming from different agencies, these are likely to contain very similar information. They are based on slightly different background data and information but they ought to actually have very similar information content, so it shouldn't be a surprise that they both show up near the top and there is an aspect of redundancy there if they are both available to us. But there may be other variables that may be strong competitors with similar scores but they measure completely different aspects of process we are studying.
It is useful to know if a strong but perhaps 2nd best predictor splits the data more evenly than the first. This is the second time, or third time we mentioned that topic in this particular session and we will talk about it in another one of the videos, but there are times in which you will find a tree much more acceptable if it is relatively balanced vs. unbalanced in terms of how the data is divided as you move down the tree.
If you find you like the second best predictor for whatever reason, much better, and it doesn't have to be the second, it could be the third or the fourth, but there is a variable there that is reasonably good but doesn't manage to be the absolute winner you might try forcing that second best or third best predictor into the root to see what happens. There will be times when doing this will actually give you a better overall tree. The pattern of the top splitters may reflect problems. For example, you may find that there are several very strong variables that all have very similar improvements and they're all so good that you have to label them as too good to be true, if that's the case rather than trying to remove these one at a time as you models, look at the list of competitors in the root node and see if you actually have not just one problem variable, but a collection of them. This will save you time and effort and you can remove them altogether.