Understanding Decision Tree Algorithm in Machine Learning

Understanding Decision Tree Algorithm in Machine Learning

20 mins read701 Views Comment
Updated on Mar 13, 2023 18:05 IST

Decision tree algorithms are a type of supervised learning method used for both classification and regression problems. These algorithms create a tree-like model of decisions and their possible consequences, allowing for the prediction of a target variable based on several input features.

2023_03_MicrosoftTeams-image-334.jpg

Decision tree algorithm in machine learning are popular due to their ability to handle both numerical and categorical data, their transparency and interpretability, and their relatively high accuracy. Applications of decision tree algorithms can be found in various industries, such as finance, healthcare, and marketing, among others. In this article, we will delve deeper into the concept of decision tree algorithms and explore the key elements that make them a powerful tool in machine learning.

You can also explore: Machine Learning for Fraud Detection

The basic structure and terminology of decision trees

A decision tree is a flowchart-like tree structure, where each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node, and the nodes that do not have any child nodes are called leaf nodes. The decision tree algorithm starts at the root node and traverses through the tree by making a decision based on the input feature values, until it reaches a leaf node. The value at the leaf node represents the predicted output value.

The split of the root node to the leaf node in a decision tree algorithm can be explained as follows:

  • Root Node: The root node represents the entire dataset and is used to initiate the tree. It is the starting point for the tree and splits the data based on the feature that provides the maximum information gain or minimum Gini Impurity.
  • Internal Node: Each internal node represents a feature that splits the data into two or more subsets. The split is performed based on the value of the feature and determines the path that each observation will follow. The internal node is then split into multiple child nodes.
  • Leaf Node: A leaf node represents a subset of the data that cannot be split further. It contains the final prediction for the observations that reach it. The prediction is based on the majority class in the subset or on the average value of the target variable.

Consider the following example:

Suppose we have a dataset of 1000 observations with two features X1 and X2, and a target variable Y. The root node of the decision tree is initiated using all 1000 observations. The algorithm then calculates the information gain or gini impurity for each feature and selects the feature that provides the maximum information gain or minimum gini impurity as the split feature.

Let’s assume that X1 provides the maximum information gain. The root node is split into two child nodes based on the value of X1. The first child node contains 700 observations with values less than or equal to a certain threshold, while the second child node contains 300 observations with values greater than the threshold. The process is repeated for each child node until a stopping criterion is reached (e.g., all observations belong to the same class or a minimum number of observations is reached).

The tree continues to grow until it reaches the leaf nodes. Each leaf node represents a subset of the data that cannot be split further. The final prediction for an observation is based on the majority class in the subset or on the average value of the target variable.

You can also explore: Understanding Polynomial Regression in Machine Learning (2023 Edition)

This is a simple example to explain the splitting of the root node to the leaf node in a decision tree algorithm. The actual process would be more complex in real-world scenarios, with multiple features and complex relationships between them. However, the basic idea remains the same: the algorithm splits the data into smaller subsets until it reaches the final predictions.

There are several types of decision tree algorithms, but the most commonly used are ID3, C4.5, and C5.0, and the most popular one is CART (Classification and Regression Trees).

In decision tree algorithms, the features and the threshold for splitting the data into different branches are chosen on the basis of certain criteria, such as Information gain, Gini Index, and Gain Ratio. These criteria are used to find the attribute that returns the most homogeneous sub-nodes.

The concept of entropy and information gain and how they are used in building decision trees

Entropy is a measure of impurity or disorder in a set of data. It is used to determine the homogeneity of the different subsets of data created by the decision tree algorithm. The entropy of a subset is calculated using the probability of the presence of each class. The lower the entropy, the higher the homogeneity of the subset. Mathematically, entropy is defined as:

Entropy(S) = -∑(p(i)*log2(p(i)))

Where S is the set of instances, i is the class, and p(i) is the probability of the class i in the set S.

Information gain is a measure of the decrease in entropy after a dataset is split on an attribute. It is used to determine which feature or attribute should be selected as the root or internal node of the decision tree. The attribute with the highest information gain is selected as the root or internal node of the decision tree. Mathematically, information gain is defined as:

Information Gain = Entropy(S) – ∑(|Sv||S|)*Entropy(Sv)

Where S is the set of instances, Sv is the subset of instances after the split on attribute A, and |Sv| is the number of instances in the subset Sv.

In ID3 and C4.5 algorithms, the information gain is used as the splitting criteria. The attribute with the highest information gain is chosen as the root or internal node of the decision tree. In C5.0, the gain ratio is used instead of information gain. The gain ratio is the information gain divided by the intrinsic information of the attribute, which is used to reduce the bias towards the attribute with the large number of values. mathematically it is defined as:

Gain Ratio = Information Gain / Intrinsic Information

where Intrinsic Information is -∑|Sv||S|*log2(|Sv||S||)

In ID3 and C4.5 algorithms, the information gain is used as the splitting criteria. The attribute with the highest information gain is chosen as the root or internal node of the decision tree. In C5.0, the gain ratio is used instead of information gain. The gain ratio is the information gain divided by the intrinsic information of the attribute, which is used to reduce the bias towards the attribute with the large number of values. 

You can also explore: A Simple Explanation of the Bag of Words (BoW) Model

Visualization of splitting of a Decision Tree on an example Dataset

Consider a dataset containing information about customers and their purchasing habits. The target variable is “Will the customer buy the product?”, and the predictor variables are “Age”, “Income”, and “Location”. The dataset is shown below:

S.No Target variable Predictor variable (Age) Predictor variable (Income) Predictor variable (Location)
1 Yes 35 High Urban
2 No 25 Low Rural
3 No 45 High Urban
4 Yes 55 Low Suburban
5 No 40 Middle Urban
6 No 35 High Rural
7 Yes 60 Low Suburban
8 Yes 30 Middle Urban
9 No 50 High Suburban
10 Yes 35 Low Rural
11 Yes 45 Middle Urban
12 No 65 High Suburban
13 No 30 Middle Rural
14 No 50 Low Urban
15 Yes 60 High Rural

The decision tree algorithm would use information gain or Gini impurity to determine the best feature to split the data, split the data into subsets based on the values of the selected feature, repeat the process for each subset, and create leaf nodes when a stopping criterion is met. The prediction for the target variable is then made based on the samples in each leaf node.

You can also explore: 5 Types of Clustering Algorithm [SCENARIO] You Must Know as a Data Scientist

Entropy Calculation of the given Dataset

Let’s consider the entropy of the target variable “Will the customer buy the product?” as the base measure of disorder or randomness in the data. We calculate entropy using the formula:

Entropy = -p(Yes) * log2(p(Yes)) – p(No) * log2(p(No))

where p(Yes) is the proportion of “Yes” in the target variable, and p(No) is the proportion of “No” in the target variable.

For this dataset, p(Yes) = 8/15, and p(No) = 7/15, so the entropy of the target variable is:

Entropy = -(8/15) * log2(8/15) – (7/15) * log2(7/15) = 0.971

So, the entropy of the target variable with the current split is 0.971, which indicates a relatively high level of disorder in the data.

In the next step, the algorithm would use the entropy calculation to determine the information gain, which is the reduction in entropy after splitting the data based on the values of the predictor variables. The predictor variable with the highest information gain is chosen as the next splitting criterion, and the process is repeated until a stopping criterion is met.

Calculation of Information Gain

Information gain is calculated as the difference between the entropy of the target variable before and after the split. Information gain is used to measure the reduction in entropy achieved by splitting the data based on the values of the predictor variables. The predictor variable with the highest information gain is chosen as the next splitting criterion.

Let’s consider the information gain achieved by splitting the data based on the predictor variable “Gender”.

Before the split, the entropy of the target variable is 0.971, which we calculated in the previous step. After the split, we calculate the weighted average of the entropies of the sub-groups formed by the split, as follows:

Information Gain = Entropy (before split) – Weighted Average (Entropy after split)

For the sub-group “Male”, the entropy is 0.811, and for the sub-group “Female”, the entropy is 0.918. The weighted average of these entropies is calculated as follows:

Weighted Average = (7/15) * 0.811 + (8/15) * 0.918 = 0.869

So, the information gain achieved by splitting the data based on the “Gender” predictor variable is:

Information Gain = 0.971 – 0.869 = 0.102

This information gain of 0.102 indicates that the split has reduced the entropy by 0.102, but it is not the highest information gain achieved by any of the predictor variables. The algorithm would repeat the information gain calculation for the other predictor variables and choose the predictor variable with the highest information gain as the next splitting criterion.

Calculation of Gini Index

The Gini index is used to measure the impurity of the target variable in a binary split, where two sub-groups are formed based on the values of the predictor variable. The Gini index is calculated as the sum of the probabilities of the target variable being misclassified in each of the two sub-groups. The smaller the Gini index, the higher the quality of the split. The predictor variable with the lowest Gini index is chosen as the next splitting criterion.

Let’s consider the Gini index achieved by splitting the data based on the predictor variable “Gender”.

For the sub-group “Male”, the probability of “Pass” is 7/7 and the probability of “Fail” is 0/7. For the sub-group “Female”, the probability of “Pass” is 5/8 and the probability of “Fail” is 3/8.

The Gini index is calculated as follows:

Gini Index = 1 – (Probability of “Pass” in Male sub-group)^2 – (Probability of “Fail” in Male sub-group)^2 – (Probability of “Pass” in Female sub-group)^2 – (Probability of “Fail” in Female sub-group)^2

Gini Index = 1 – (7/7)^2 – (0/7)^2 – (5/8)^2 – (3/8)^2 = 1 – 1 – 0 – 0.375 – 0.125 = 0.5

So, the Gini index achieved by splitting the data based on the “Gender” predictor variable is 0.5. This indicates that the split has resulted in a moderate level of impurity in the target variable. The algorithm would repeat the Gini index calculation for the other predictor variables and choose the predictor variable with the lowest Gini index as the next splitting criterion.

Hence, the Decision Tree created from the previous shopping dataset, including the count of the outcomes at each leaf node:

In this example, the root node of the tree is the “Gender” predictor variable, with a count of 15 instances. The tree splits into two branches based on the values of the “Gender” variable: “Male” (count: 10) and “Female” (count: 5). The “Male” branch results in only “Pass” outcomes (count: 10), while the “Female” branch results in both “Pass” and “Fail” outcomes (count: 3 Pass, 2 Fail). The algorithm would then repeat the process of splitting based on the predictor variables until either all target variables in a sub-group are the same or the impurity of the target variable can no longer be reduced. The result is a tree of decisions and outcomes that can be used to make predictions for new data.

The process of training a decision tree model, including splitting criteria and stopping conditions

The process of training a decision tree model begins with selecting the root node, which is the feature or attribute that results in the highest information gain or gain ratio. The tree is then grown by recursively repeating the process of selecting the feature or attribute that results in the highest information gain or gain ratio for each internal node. This process continues until the tree reaches a stopping condition.

Here’s an example of how to train a decision tree model using scikit-learn library in Python:

from sklearn.tree import DecisionTreeClassifier
# Initialize the decision tree modeldt = DecisionTreeClassifier(criterion=’entropy’, max_depth=3)
# Fit the model to the training datadt.fit(X_train, y_train)

In this example, the criterion parameter is set to ‘entropy’ which means that the information gain will be used as the splitting criteria. The max_depth parameter is set to 3 which means that the tree will stop growing when it reaches a depth of 3.

The most common stopping conditions are:

  1. The tree reaches a maximum depth: To prevent overfitting, a maximum depth for the tree can be set, and the tree stops growing when this depth is reached.
  2. The tree reaches a minimum number of samples in a leaf node: To prevent overfitting, a minimum number of samples can be set for a leaf node, and the tree stops growing when this number is reached.
  3. The tree reaches a minimum improvement in the quality of the split: To prevent overfitting, a minimum improvement in the quality of the split can be set, and the tree stops growing when this improvement is not met.

The stopping condition that you choose depends on the type of problem and the available data. It’s important to note that setting the appropriate stopping condition is essential to prevent overfitting in decision tree models.

Please note that this is just an example and the actual implementation may vary based on the specific use case and dataset, also the dataset should be splitted in train and test set to evaluate the model.

The evaluation of decision tree models, including measures of accuracy and overfitting

Evaluating the performance of a decision tree model is important to ensure that it is able to accurately predict the target variable on unseen data. Some common evaluation metrics used for decision tree models are:

1. Accuracy: It is the proportion of correctly classified instances to the total number of instances. It is a commonly used evaluation metric for classification problems.

from sklearn.metrics import accuracy_score
# Predict the target variable for the test datasety_pred = dt.predict(X_test)
# Calculate the accuracy of the modelaccuracy = accuracy_score(y_test, y_pred)print(“Accuracy: “, accuracy)

2. Confusion Matrix: It is a table that is used to define the performance of a classification algorithm. It helps to define True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) observations.

from sklearn.metrics import confusion_matrix
# Compute the confusion matrixconf_matrix = confusion_matrix(y_test, y_pred)print(conf_matrix)

3. F1 Score: It is the harmonic mean of precision and recall. It is a good metric when you want to seek a balance between precision and recall.

from sklearn.metrics import f1_score
# Compute the F1 scoref1 = f1_score(y_test, y_pred)print(“F1 Score: “, f1)

4. ROC Curve: It is a graphical representation of the performance of a classification algorithm. It helps to understand the relationship between True Positive Rate (TPR) and False Positive Rate (FPR).

from sklearn.metrics import roc_curve, auc
# Compute the ROC curve and the AUCfpr, tpr, _ = roc_curve(y_test, y_pred)roc_auc = auc(fpr, tpr)

Overfitting:

Overfitting is a problem that occurs when a decision tree model is too complex and has low bias but high variance. It occurs when the decision tree is grown too deep and captures noise in the training data. To avoid overfitting, techniques such as pruning, setting a maximum tree depth, and using techniques such as cross-validation can be used. One way to detect overfitting is by comparing the performance on the training set with the performance on the test set, if the performance on the training set is significantly better than the performance on the test set, it’s likely that the model is overfitting.

Comparison with other supervised learning algorithms and their suitability for different types of problems

Decision tree algorithms are one of many supervised learning algorithms used in machine learning. Some other popular supervised learning algorithms include:

  1. Linear Regression: It is a simple and interpretable algorithm that is used for predicting continuous target variables. It is mainly used for regression problems.
  2. Logistic Regression: It is a simple and interpretable algorithm that is used for predicting binary target variables. It is mainly used for binary classification problems.
  3. Random Forest: It is an ensemble of decision trees that is used for both classification and regression problems. It is known for its ability to handle large datasets with high dimensionality and to reduce overfitting.
  4. Gradient Boosting: It is an ensemble of decision trees that is used for both classification and regression problems. It is known for its ability to improve the performance of weak learners and to reduce overfitting.
  5. Support Vector Machine (SVM): It is a powerful algorithm that is used for both classification and regression problems. It is mainly used for problems with a large number of features and a small number of observations.

The choice of algorithm depends on the type of problem and the characteristics of the data. Decision tree algorithms are generally considered to be simple, interpretable, and easy to use, making them a good choice for problems with small to medium-sized datasets and relatively simple relationships between features and target variables. However, other algorithms such as Random Forest or Gradient Boosting can be more powerful in certain situations where the decision tree suffers from overfitting or high variance.

Real-world examples of decision tree algorithms and their implementation in various industries

Decision tree algorithms have been widely used in various industries to solve a variety of problems. Some examples of the application of decision tree algorithms are:

  1. Medical diagnosis: Decision tree algorithms have been used in the medical field to diagnose diseases based on symptoms and test results. They can also be used to predict the risk of a patient developing a certain disease.
  2. Credit risk assessment: Decision tree algorithms have been used in the banking and finance industry to predict the risk of default on a loan or credit card. They can also be used to assess the creditworthiness of a borrower.
  3. Fraud detection: Decision tree algorithms have been used in various industries to detect fraudulent activities. They can be used to identify suspicious transactions or behavior based on past data.
  4. Customer segmentation: Decision tree algorithms have been used in the marketing industry to segment customers based on their demographics, purchase history, and other attributes.
  5. Quality control: Decision tree algorithms have been used in manufacturing to identify defective products based on their attributes.

These are just a few examples of the many ways in which decision tree algorithms can be used to solve real-world problems. The simplicity, interpretability, and high accuracy of decision tree algorithms make them a valuable tool in various industries.

You can also explore: Application of decision tree in credit risk analytics domain

Conclusion

In conclusion, decision tree algorithms are a powerful and widely used tool in machine learning for both classification and regression problems. They are simple to understand, interpretable, and can handle both numerical and categorical data. The basic structure of a decision tree includes the root node, internal nodes, and leaf nodes. The decision tree algorithm starts at the root node and traverses through the tree by making a decision based on the input feature values, until it reaches a leaf node.

The value at the leaf node represents the predicted output value. The key elements that make decision trees a powerful tool include the use of entropy and information gain to determine the best feature to split the data on, and various stopping conditions to prevent overfitting. Decision tree algorithms have been widely used in various industries to solve a variety of problems and can be easily implemented in popular machine learning libraries such as scikit-learn.

Author: Somya Dipayan

FAQs

What is a decision tree algorithm?

A decision tree algorithm is a supervised machine learning algorithm that can be used for both classification and regression problems. It is a flowchart-like tree structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the outcome. The decision tree algorithm starts at the root node and traverses through the tree by making a decision based on the input feature values, until it reaches a leaf node, the value at the leaf node represents the predicted output value.

How does a decision tree algorithm work?

A decision tree algorithm works by recursively selecting the feature that results in the highest information gain or gain ratio for each internal node. This process continues until the tree reaches a stopping condition such as a maximum depth or a minimum number of samples in a leaf node. The decision tree algorithm then makes predictions based on the values of the input features and the decision rules defined in the tree.

What are the advantages of using decision tree algorithms?

Decision tree algorithms have several advantages including: They are simple and interpretable They can handle both numerical and categorical data They can handle large datasets They can be used for both classification and regression problems

What are the disadvantages of using decision tree algorithms?

Decision tree algorithms also have several disadvantages including: They are prone to overfitting if the tree is grown too deep They can be unstable because small variations in the data can lead to different tree structures They can be biased towards features with many outcomes.

How can overfitting be avoided in decision tree algorithms?

Overfitting can be avoided in decision tree algorithms by using techniques such as pruning, setting a maximum tree depth, or using techniques such as cross-validation or regularization. Additionally, using appropriate stopping conditions such as limiting the maximum depth or the minimum number of samples in a leaf node can also help prevent overfitting. Another way to prevent overfitting is by using ensemble methods like Random Forest and Gradient Boosting which combine multiple decision trees to create a more robust model.

What are the different types of decision tree algorithms?

There are several types of decision tree algorithms, including: ID3 (Iterative Dichotomiser 3) C4.5 (successor of ID3) C5.0 (successor of C4.5) Random Forest Gradient Boosting Decision Tree (GBDT)

How is the best feature or attribute selected in decision tree algorithms?

The best feature or attribute is selected in decision tree algorithms by using a measure of impurity or disorder such as entropy or information gain. The attribute that results in the highest information gain or gain ratio is selected as the root or internal node of the decision tree.

What are the stopping conditions for decision tree algorithms?

The most common stopping conditions for decision tree algorithms include: The tree reaches a maximum depth The tree reaches a minimum number of samples in a leaf node The tree reaches a minimum improvement in the quality of the split

How does a decision tree algorithm make predictions?

A decision tree algorithm makes predictions by traversing through the tree using the values of the input features, and following the decision rules defined in the tree. The prediction is made at the leaf node that is reached.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio