Simply Salford Blog

Requested vs. Actual Tree Sizes in TreeNet Models

Posted by Dan Steinberg on Tue, Jul 2, 2013 @ 11:46 AM

One of the most important controls in TreeNet is the maximum number of terminal nodes permitted in each tree (the NODES=number parameter setting on the TreeNet command). You would think that if you ask for say NODES=4 that all of your trees would have no more than 4 trees. However, that is not exactly how things will turn out unless your data contain NO MISSINGS. If there are missings in your data and variables with missing values are used as splitters then the trees may actually contain more nodes than expected.

To explain, we will start by looking at 2-node trees, or trees that are supposed to contain only one split of the root node. Suppose X1 is the variable and that it is sometimes missing. Then the TreeNet tree will begin with a split as follows:

Is X1 missing?

Yes                     No

Terminal_Node_1         Is X1 <=  value?

Yes             No

terminal_Node_2       Terminal_Node_3


Here we see that for the tree to make any use of the splitter X1 on its good values we require three modes. In TreeNet, if all of your predictors have missing values then even if you request NODES=2 you should expect to see trees with three terminal nodes.

The machinery of the TreeNet engine is actually a bit more complicated yet.  In the example above,  we followed an "is X1 missing" split by a split on X1 itself, but there is no guarantee that TreeNet will elect to do this.  Once we begin the search for the best split on the right hand child of the root in the above tree we might decide to follow with a split on "Is X7 missing". If this did occur then we would end up with at least 4 and not 3 terminal nodes.

The general point is that splits on missing values are not counted by TreeNet in the process of building the tree.  And because a series of "Is Xj missing" questions can line up one underneath the other, it is technically possible to generate fairly large trees even if NODES=2 has been. But, you can be sure that if NODES=2 has been set, each tree will contain no more than one split on an actual variable (as opposed to a missing value status).


Topics: TreeNet