Decision Tree

Introduction

The concept behind the decision tree is very simple to understand:

Each node (leaf of the tree) is a decision point that leads to one branch, then to another decision point. A “question” is asked at each decision point to determine which branch to follow

The first node is called “root node” or “parent node”
Other nodes that continue to further branches are called “decision nodes” or “child nodes”
The last node is called “leaf node”
The act of dividing a decision node into more nodes is called “splitting”
The act of simplifying, removing nodes and branches (compressing a decision tree) is called “pruning”

How can decision trees be used?

Regression (Predict a number) → predict rainfall, revenue, marks…
Classification → predict if a team will win or lose a match, if a person’s health, the iris breed…

With logistic regression, decision trees are quite popular models and each have advantages and inconvenients:

Criteria	Logistic Regression	Decision Tree Classification
Interpretability	➖	➕
Decision boundaries	Linear and single	Divides space into smaller ones
Ease of decision making	Set a threshold manually	Automatic
Overfitting sensitivity	➖	➕
Noise sensitivity	➖	➕
Required size of training set	Quite large	Small

Logistic Regression vs Decision tree (Overfit)

In the picture above, the decision tree is overfit, and has too many specifics and conditions (branches and nodes) that looses a lot in accuracy. The logistic regression handled better this situation. In the picture below, the opposite happened.

Logistic Regression vs Decision tree (Perfect fit)

Splitting criterias

The criterias are what help the algorith choosing to go one branch or another. There are many but the 3 main ones are:

Gini Impurity measures how often the value would be wrongly classified if the decision was made randomly. How impure is a nod.

\[I_G(n)=1-\sum_{i=1}^{J}(p_i)^2\]

Entropy measures the tidyness of a system in a dataset and helps choosing the value that brings most gain of information (the decision made in the root node)

\[Entropy=\sum_{i=1}^C-p_i*log_2(p_i)\]

Variance is used in regression models. Should remain low to avoid overfitting the model.

\[Variance=\frac{\sum(X-\overline{X})^2}{n}\]

Implementation

With Python and SciKitLearn it is easy to create a machine learning model using a decision tree. There are many datasets available online to test it out.

Start by importing pandas and numpy. The datasaet can be imported from scikit directly. The iris dataset is made of 150 lines for 4 columns that contain the size (length and width) of sepals and petals, that help determine the class of the iris, as follows:

Iris classes description

Import packages and load data

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()

The data is now loaded in Colab. We need to identify where is the data used to guess, and where are the labels (answers) for the training part:

X = data.data
y = data.target

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=50,test_size=0.25)

from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=50)

This where we give parameters to build the model. Because of the nature of decison trees a certain randomness in the branch making is required. Any number works, but different ones at each training will give verying results. The test sample size is 25% of the total dataset. 75% of the dataset will be used for training.

Have a look at the tree parameters

The get_params() method allows to have a glimpse at the parameters set for this tree. Here we have :

parameters = pd.Series(clf.get_params()).to_frame('Value')
print('Tree parameters:\n',parameters)

Tree parameters	Value
ccp_alpha	0
class_weight	None
criterion	gini
max_depth	None
max_features	None
max_leaf_nodes	None
min_impurity_decrease	0
min_samples_leaf	1
min_samples_split	2
min_weight_fraction_leaf	0
random_state	50
splitter	best

The splitting criterion is Gini by default.

Test the model and see its accuracy

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Accuracy of the training data (Gini):', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy of the testing data (Gini):',accuracy_score(y_true=y_test,y_pred=y_pred))
print('\n')

Results:

Accuracy of the training data (Gini): 1.0 Accuracy of the testing data (Gini): 0.9473684210526315

So next time we feed data bout length and width of Iris, This model can tell with 94.74% accuracy wether it is Setosa, Versicolor or Virginica.

To test the model with an Entropy splitting criterion, or others, you can edit this line of code tree.DecisionTreeClassifier(random_state=50) to tree.DecisionTreeClassifier(random_state=50,criterion = 'entropy'.

Visualization

We can also visualize the tree with a little more code:

!pip install pydotplus
from sklearn.tree import export_graphviz
from six import StringIO 
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = data.feature_names,class_names=data.target_names)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Result:

Iris Decision Tree

Decision Trees - 결정 트리

Decision Tree - 결정 트리