Decision Tree - 결정 트리

Decision Tree

Introduction

The concept behind the decision tree is very simple to understand:

Each node (leaf of the tree) is a decision point that leads to one branch, then to another decision point. A “question” is asked at each decision point to determine which branch to follow

  • The first node is called “root node” or “parent node”
  • Other nodes that continue to further branches are called “decision nodes” or “child nodes”
  • The last node is called “leaf node”
  • The act of dividing a decision node into more nodes is called “splitting”
  • The act of simplifying, removing nodes and branches (compressing a decision tree) is called “pruning”

How can decision trees be used?

  • Regression (Predict a number) → predict rainfall, revenue, marks…
  • Classification → predict if a team will win or lose a match, if a person’s health, the iris breed…

With logistic regression, decision trees are quite popular models and each have advantages and inconvenients:

Criteria Logistic Regression Decision Tree Classification
Interpretability
Decision boundaries Linear and single Divides space into smaller ones
Ease of decision making Set a threshold manually Automatic
Overfitting sensitivity
Noise sensitivity
Required size of training set Quite large Small

Logistic Regression vs Decision tree (Overfit)

In the picture above, the decision tree is overfit, and has too many specifics and conditions (branches and nodes) that looses a lot in accuracy. The logistic regression handled better this situation. In the picture below, the opposite happened.

Logistic Regression vs Decision tree (Perfect fit)

Splitting criterias

The criterias are what help the algorith choosing to go one branch or another. There are many but the 3 main ones are:

Gini Impurity measures how often the value would be wrongly classified if the decision was made randomly. How impure is a nod.

\[I_G(n)=1-\sum_{i=1}^{J}(p_i)^2\]

Entropy measures the tidyness of a system in a dataset and helps choosing the value that brings most gain of information (the decision made in the root node)

\[Entropy=\sum_{i=1}^C-p_i*log_2(p_i)\]

Variance is used in regression models. Should remain low to avoid overfitting the model.

\[Variance=\frac{\sum(X-\overline{X})^2}{n}\]

Implementation

With Python and SciKitLearn it is easy to create a machine learning model using a decision tree. There are many datasets available online to test it out.

Start by importing pandas and numpy. The datasaet can be imported from scikit directly. The iris dataset is made of 150 lines for 4 columns that contain the size (length and width) of sepals and petals, that help determine the class of the iris, as follows:

Iris classes description

Import packages and load data

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()

The data is now loaded in Colab. We need to identify where is the data used to guess, and where are the labels (answers) for the training part:

X = data.data
y = data.target

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=50,test_size=0.25)

from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=50)

This where we give parameters to build the model. Because of the nature of decison trees a certain randomness in the branch making is required. Any number works, but different ones at each training will give verying results. The test sample size is 25% of the total dataset. 75% of the dataset will be used for training.

Have a look at the tree parameters

The get_params() method allows to have a glimpse at the parameters set for this tree. Here we have :

parameters = pd.Series(clf.get_params()).to_frame('Value')
print('Tree parameters:\n',parameters)
Tree parameters Value
ccp_alpha 0
class_weight None
criterion gini
max_depth None
max_features None
max_leaf_nodes None
min_impurity_decrease 0
min_samples_leaf 1
min_samples_split 2
min_weight_fraction_leaf 0
random_state 50
splitter best

The splitting criterion is Gini by default.

Test the model and see its accuracy

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Accuracy of the training data (Gini):', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy of the testing data (Gini):',accuracy_score(y_true=y_test,y_pred=y_pred))
print('\n')

Results:

Accuracy of the training data (Gini): 1.0 Accuracy of the testing data (Gini): 0.9473684210526315

So next time we feed data bout length and width of Iris, This model can tell with 94.74% accuracy wether it is Setosa, Versicolor or Virginica.

To test the model with an Entropy splitting criterion, or others, you can edit this line of code tree.DecisionTreeClassifier(random_state=50) to tree.DecisionTreeClassifier(random_state=50,criterion = 'entropy'.

Visualization

We can also visualize the tree with a little more code:

!pip install pydotplus
from sklearn.tree import export_graphviz
from six import StringIO 
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = data.feature_names,class_names=data.target_names)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Result:

Iris Decision Tree

Sources