# Back To The Basics

I’ve decided to compile the notes I’ve made over the course of my ML journey as a series of blog posts here on my website.

You can view other topics in this series by clicking on the **ML NOTES** category in the article header above.

## Disclaimer

I’ve read through multiple sources; articles, documentation pages, research papers and textbooks, to compile my notes. I was looking to maximise my understanding of the concepts and, previously, never intended to share them with the world. So, I did not do a good job of documenting sources for reference later on.

I’ll leave references to source materials if I have them saved. Please note that I’m not claiming sole authorship of these blog posts; these are just my personal notes and I’m sharing them here in the hopes that they’ll be helpful to you in your own ML journey.

Take these articles as a starting point to comprehend the concepts. If you spot any mistakes or errors in these articles or have suggestions for improvement, please feel free to share your thoughts with me through my LinkedIn.

We’ll take a look at the Decision Tree algorithm in this post

# Decision Tree

The decision tree is a supervised, tree-based algorithm used for classification or regression tasks. The decision tree algorithm learns a set of (basically what amounts to) **nested if-then statements** to partition the input space into homogenous groups.

The algorithm resembles a hierarchical binary tree, with a **root node** branching out to intermediate nodes known as **decision** or **internal nodes**. These internal nodes may further branch out to other internal nodes, or the tree might end at terminal nodes known as **leaf nodes**.

The leaf nodes represent all possible outcomes that can occur for the given task. So, for classification tasks this means each leaf node represents each unique class, and for regression tasks, the leaf node may be defined by a function comprising the features.

When a learned decision tree predicts on a new sample, the sample goes through each node in succession based on the decisions at each node. Then it either ends up being classified into a class for classification tasks, or, for regression a regression output is calculated for that sample based on the formula at the terminal node.

A good decision tree is one which is able to separate samples into perfectly homogeneous groups.

# Building the Tree

Of the many techniques used to construct trees, the one used commonly and discussed in this post is the classification and regression tree (CART).

## Setting up the decision nodes

The decision tree algorithm builds a series of decision nodes during the training process. At each node, the algorithm selects a feature + a split point on that feature to divide the samples into different sub-groups.

The algorithm tries to find the **best split point** at each node such that it achieves a good homogenous separation of the samples, i.e, each sub-group created as a result of that split contains a larger proportion of samples belonging to one class.

Well, we don’t know which split point is going to be the best in advance, so we try ** ALL possible** split points, use a metric to quantitatively evaluate the quality of each split, and then choose the split point with the lowest score.

Here are the steps involved in building the tree from scratch:

At the root node, the algorithm starts by considering all samples in the training dataset, and lines up every feature + each unique value of each feature as possible split points.

This process employs the greedy search algorithm as it tries to find the best split among all splits every time.

Then, the quality of each split point is evaluated using a metric. The split point with the lowest score for the metric is chosen as the best split.

Now, the training samples are divided into 2 sub-groups using this best split point.

The process is repeated for each sub-group of data; once again the (new) best split point for each sub-group is determined by considering all features + unique values of each feature occurring in the subset of samples at that group.

This process of setting up internal nodes + the split points continues until some stopping criteria is met.

The node at which the tree terminates is a leaf node, and the output of the tree is derived from the leaf nodes.

## Metric to evaluate splits

In the process of building the decision tree, we need a method to evaluate the quality of the split to be able to compare two or more split points, find the best split etc.

Based on the task, classification or regression, different functions are used to measure the quality of a split.

### 1. Regression

For regression tasks, the sum of squared errors (SSE) is used to measure the quality of a split.

The model lines up all possible features and each distinct value for each feature as possible split points.

Starting with and at the root node, the algorithm splits the entire training dataset into two subgroups, \(\mathbf{S_1}\) and \(\mathbf{S_2}\).

\[ SSE = \sum_{i \in S_1} (y_i - \bar{y}_1)^2 + \sum_{i \in S_2} (y_i - \bar{y}_2)^2 \]

where \(y_i\) - model’s predictions, and, \(\bar{y}_1\) & \(\bar{y}_2\) - the average \(\mathbf{y_true}\) values within each subset.

The squared error between the model’s predictions and the mean \(\mathbf{y_true}\) (ground truth values) is calculated for each group and the squared errors of both the groups are added to give the

SSE.

Every split is evaluated using this SSE metric and the one with the lowest SSE is chosen as the best split point for that node.

### 2. Classification

For classification tasks the Gini impurity score is used to evaluate the quality of splits. I found this excellent definition for Gini Impurity.

“Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.”

— Brian Ambielli, Gini Impurity (With Examples)

In simple words, Gini impurity score is a measure of the **probability of misclassifying a data point picked at random**, if that data point was labelled according to the *distribution of the class labels* (distribution of the class labels here depends on the subset of datapoints at that particular node).

*In more direct terms*, if the data points at a node are a mixed bag of classes (equal or near equal proportion of each class), then you’re more likely to misclassify a randomly picked datapoint due to the equal distribution of the class labels; the data point you picked could belong to either class with equal (or near equal) probability.

Now if you’re like I was, and the definition still doesn’t make *intuitive sense*, keep reading below.

What does it mean?

Gini impurity score quantifies the “impurity” of a group based on the class distribution of the samples assigned to that group. Gini impurity score will be lower if the majority of the samples assigned to a group belong to a single class. Likewise, if a group contains samples from both classes the Gini impurity score will be higher.

Gini impurity score is calculated using this formula

\[ G = \sum_{i=1}^{C} \space p_i \space (1 - p_i) \]

where \(\mathbf{p_i}\) is the probability of a datapoint belonging to class \(\mathbf{i}\)

Gini impurity scores may have the following range:

**0**indicating perfect purity, i.e., all samples in the group belong to one class;**maximum purity****0.5**, for binary classification, this means that the group has an equal mix of samples from both classes;**maximum***impurity***1**, usually assigned for multiclass classification tasks, this score indicates that the group comprises a mix of samples belonging to multiple classes.

### Gini impurity intuition

The Gini impurity score is not as intuitive to pick up like the sum of squared error for regression trees, so I thought I’d include a toy example we could work through.

In simple terms, the Gini impurity score tells us how pure the samples at a node are. Purity at a node, is determined by the distribution of classes within that node. More varied the distribution of classes within that node, higher the Gini impurity score.

Victor Zhou’s example clearly lays down an intuitive explanation. Check his blog post here.

A simple example is provided below:

Consider a toy dataset which has 10 samples in total.

- 5 of these samples belong to class A, and,
- the remaining 5 samples belong to class B.

So the probability of picking each class: \(p_A = p_B = 0.5\)

The Gini impurity score before the split (let’s denote it by \(\mathbf{GI_before}\)) will be;

\[\begin{align*} \mathbf{GI_before} &= p_A * (1 - p_A) + p_B * (1 - p_B) \\ &= 0.5 * (1 - 0.5) + 0.5 * (1 - 0.5) \\ &= 0.5 \\ \therefore \mathbf{GI_before} = 0.5 \end{align*}\]

the highest impurity score indicating that our dataset has an equal mix of both classes.

Now consider a split point dividing this dataset into two groups. There are a couple of divisions this split point can achieve based on the class distribution of the divided groups:

I. Imperfect division: In this scenario, the split point divides the classes imperfectly, such that the groups obtained thereafter contain a mixed proportion of samples from both classes.

Let’s say group 1 has 5 data points; 4 belonging to class A and 1 belonging to class B.

Similarly, group 2 has 5 data points; 1 belonging to class A and 4 belonging to class B.

**Case I**

*Group 1*

Probability of picking each class then becomes;

- \(p_A = 0.8\)
- \(p_B = 0.2\)

Calculating Gini index for group 1;

\[\begin{align*} \mathbf{GI_1} &= p_A * (1 - p_A) + p_B * (1 - p_B) \\ &= 0.8 * (1 - 0.8) + 0.2 * (1 - 0.2) \\ &= 0.32 \\ \\ \therefore \mathbf{GI_1} &= 0.32 \end{align*}\]

*Group 2*

Probability of picking each class then becomes;

- \(p_A = 0.2\)
- \(p_B = 0.8\)

Calculating Gini index for group 2;

\[\begin{align*} \mathbf{GI_2} &= p_A * (1 - p_A) + p_B * (1 - p_B) \\ &= 0.2 * (1 - 0.2) + 0.8 * (1 - 0.8) \\ &= 0.32 \\ \\ \therefore \mathbf{GI_2} &= 0.32 \end{align*}\]

The weighted Gini impurity of this split is calculated by multiplying the proportion of samples at each group (\(\mathbf{w_i}\)) and the Gini impurity score at that group/node.

\[\begin{align*} \mathbf{w_i} &= \frac{\text{Number of samples at node } i}{\text{Total samples in the dataset}} \\ \\ \mathbf{GI} &= \mathbf{w_1} \cdot \mathbf{GI_1} + \mathbf{w_2} \cdot \mathbf{GI_2} \\ &= 0.5 \cdot \mathbf{GI_1} + 0.5 \cdot \mathbf{GI_2} \\ &= (0.5 \cdot 0.32) + (0.5 \cdot 0.32) \\ &= 0.32 \\ \\ \therefore \mathbf{GI_{impure}} &= 0.32 \end{align*}\]

II. Perfect division: In this case, the split point achieves homogenous separation of classes; each group would only contain one class.

Group 1 contains all 5 samples from class B and group 2 contains all samples belonging to class A.

**Case II**

Probability of picking each class then becomes \(p_A = 0\) & \(p_B = 1\) for group 1, and vice versa for group 2.

\[\begin{align*} \mathbf{GI_1} &= p_A * (1 - p_A) + p_B * (1 - p_B) \\ &= 0 * (1 - 0) + 1 * (1 - 1) \\ &= 0 \\ \\ \therefore \mathbf{GI_1} &= 0 \\ \\ \mathbf{GI_2} &= 0 \\ \\ \therefore \mathbf{GI_{pure}} &= 0 \\ \end{align*}\]

Therefore, for the second case the Gini impurity score is the lowest (and equal to zero) since the split is perfectly homogenous.

After calculating the Gini impurity score, we calculate the Gini gain which is equivalent to the amount of “impurity” reduced due to the split.

\[ \mathbf{Gini \ gain} = \mathbf{GI_before} - \mathbf{GI_split} \]

The Gini gain is the highest for case 2;

\[ \mathbf{Gini \ gain} = 0.5 - 0 = 0.5 \]

vs the Gini gain from case 1;

\[ \mathbf{Gini \ gain} = 0.5 - 0.32 = 0.18 \]

Higher the Gini gain = more impurity reduced = a better split.

## Stopping Criteria

Before growing the tree, it’s important to set conditions for when the tree should *stop growing*.

If there isn’t any stopping criteria defined, the tree can become quite complex with all sorts of feature split point combinations to achieve homogenity at every node. This is how overfitting manifests in decision trees, and such a complex tree that has fit “too well” to the training data would perform poorly on the test dataset.

Some common stopping criteria as defined on the sklearn page include:

**1. Minimum samples to split a node** - This defines the minimum number of samples that should be assigned to a node to split it further. A node with samples less than this number will not undergo further splitting and the tree stops growing at this node.

**2. Max depth** - Maximum level of splitting that is allowed in the tree; each split is considered as adding on a new level, and depth is the number of levels the tree extends from the root node.

**3. Minimum samples at leaf node** - The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least **min_samples_leaf** training samples at each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

# Pruning

The tree building process may conclude with a complex tree (more depth or more terminal nodes). Complex trees have an increased chance of overfitting and therefore need to be pruned back to an “optimal size”.

Post-hoc pruning involves traversing each node in an already-built tree and deciding whether or not to retain each node based on a validation set. One technique to help with that is *cost-complexity tuning* which penalises the tree for (greater) complexity. More terminal nodes, more is the penalty the tree receives.

It uses the modified splitting criterion where the splitting criterion is penalised by a factor (the complexity parameter) based on the total number of terminal nodes in the tree or sub-tree.

For regression trees, the error is calculated using;

\[SSE_{c_p} = SSE + c_p\ * Number \ of\ terminal\ nodes\]

A similar process is followed while pruning classification trees, you modify the purity criterion

^{2}instead of SSE.

\(c_p\) is a regularization parameter the user has to set, and using the \(c_p\) value you want to find the smallest tree that gives you the smallest error.

Similar to other regularization terms, a smaller value of \(c_p\) allows complex trees which might result in overfitting, while an overly large value of \(c_p\) might constrict the growth of trees leading to underfitting. Once again, finding \(c_p\) is a delicate balancing task between overfitting and underfitting.

You set up a sequence of \(c_p\) values and perform cross-validation using a sample of observations. This process will generate one (mean cross-validated) \(SSE_{c_p}\) score for each \(c_p\), so you chose the \(c_p\) with the lowest \(SSE_{c_p}\).

Once you find the optimal value of the complexity parameter, you then find the smallest pruned tree that has the lowest penalized error rate.

You compare the cost complexity measure of the tree without a node against the cost complexity measure of the tree *including* the node.

If the \(SSE_{c_p}\) value after removing a node is *lesser than* the \(SSE_{c_p}\) value of the tree with the node, then you can remove the node, otherwise you keep the node in.

Repeat this process with other nodes until further simplification of the tree leads to a decreased performance on the validation data. You halt pruning at this stage.

# Guidelines for working with Decision Trees

### Pros

Decision trees can handle missing data. While building the tree missing data are ignored. For each split, a variety of alternatives or surrogate splits are evaluated, and if the surrogate split approximates the original split, then it can be used when the feature data associated with the original split are not available.

DTs inherently assign feature importance when building the tree. These feature importance scores dictate feature selection as features with low importance scores will be left out of the decision tree equation during predictions. Intuitively, predictors that appear higher in the tree (i.e., earlier splits) or those that appear multiple times in the tree will be more important than predictors that occur lower in the tree or not at all.

Due to the inherent feature selection, decision trees are easy to interpret. Consecutively, it becomes easy to analyse the prediction decisions leading to increased transparency and interpretability. So, if interpretability is a key requirement for your project, you may want to start experimenting with decision trees.

Building the trees is a faster process than training algorithms that use optimization.

### Cons

Single regression trees might have sub-optimal performance compared to other optimization-based algorithms, as these are quite simple trees.

One other limitation of simple regression trees is that each terminal node uses the average of the training set outcomes in that node for prediction. As a consequence, these models may not do a good job predicting samples whose true outcomes are extremely high or low.

An additional disadvantage is that an individual tree tends to be unstable; if the training dataset is slightly altered, we may end up with an entirely different set of splits in the new trees. Decision trees have high variance.

Decision trees can exhibit selection bias; they favour those features that have many distinct values over features with fewer distinct values.

There are a couple of techniques^{3} that can be used to overcome the limitations of the individual decision tree. Stay tuned for those articles(:

Thanks for reading! Hope you’ve found this post helpful, and as always, pop in occasionally to this space to read more on machine learning.

# References

- Applied Predictive Modelling by Max Kuhn, Kjell Johnson.
- An intuitive explanation of the decision tree algorithm by Skyler Dale
- A Simple Explanation of Gini Impurity by Victor Zhou