Predicting Categorical Data Using Classification Algorithms

Predicting Categorical Data Using Classification Algorithms

7 mins read16.1K Views Comment
clickHere
Updated on Jan 27, 2023 17:12 IST

This article will demonstrate how you can build classification models using ML’s favorite programming language – Python.

2022_03_Predicting-Categorical-Data-Using-Classification-Algorithm.jpg

Classification Algorithms are Supervised Machine Learning Algorithms that use labeled data (aka training datasets) to train classifier models. These models then predict outcomes with the best possible accuracy when new data (aka testing datasets) is fed to them.

The outcome predicted by a classification algorithm is categorical in nature. These algorithms classify variables into a specific set of classes – such as classifying a text message into transactions or promotions through an SMS filter on your iPhones.

 We are going to cover the following sections:

Overview of Classification Algorithms

Classification techniques predict discrete class label output(s) to which the data elements belong. For example, weather prediction is a type of classification problem – ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there are only two classes.

Overview of Classification Algorithms

Few more examples of classification problems –

  • Speech recognition
  • Face detection
  • Spam texts/e-mails classification
  • Stock market prediction
  • Breast cancer detection
  • Employee Attrition prediction
Difference between Regression and Classification Algorithms
Difference between Regression and Classification Algorithms
Classification and regression are the very basic and important topics in machine learning. The article covers the major differences between Regression and Classification algorithms in machine learning.
Basics of Machine Learning – Definition and Concepts
Basics of Machine Learning – Definition and Concepts
This post will help you understand the emerging technology of today’s time- Machine Learning. Here we have covered basic concept of Machine Learning.
Data Structures and Algorithms in Python – All You Need to Know
Data Structures and Algorithms in Python – All You Need to Know
Python is a high-level, object-oriented programming language. It is a general-purpose language that is used in a variety of applications such as software testing, web development, data science, machine learning,...read more

How do Classification Algorithms work?

A classifier utilizes known (training) data to understand how the given input (dependent) variables relate to the target (independent) variable. 

In the above example, we will take into account the outside temperatures of previous days and use that as the training data. This data would be fed into the classifier – if it is trained accurately, it would be able to predict future weather conditions.

We use Binary Classifiers in case there are only two classes and Multi-class Classifiers for more than two class divisions.

Types of Classification Algorithms

When to use which algorithm would depend on the application and the nature of the data. The most common classification algorithms include:

Logistic Regression 

Note that, though the name is Logistic “Regression” it is actually a linear classification algorithm. It is used when the classes are linearly separable and binary – like true (1) or false (0), win (1) or lose (0), etc.

Logistics regression uses a sigmoid function to return the probability of a label. The curve obtained is called a sigmoid curve or an S-curve. The function generates a probability output. By comparing the probability with a pre-defined threshold, the object is assigned to a label accordingly.

Logistic Regression 

K-Nearest Neighbors (KNN)

If your dataset has n-features, KNN represents each data point in an n-dimensional space. It then calculates the distance between the data points. The unobserved data is then assigned the label of the nearest observed data points. KNN is commonly used for recommender systems, credit scoring, etc.

K-Nearest Neighbors (KNN)

Support Vector Machine (SVM)

Support vector classifier lets you define a set of hyper-planes, called decision boundary, that separates the data points into specific classes. The data points closest to the decision boundary are called support vectors. An optimum decision boundary will have a maximum distance from each of the support vectors. Margins are the shortest perpendicular distance between the support vectors and the decision boundary.

Support Vector Machine (SVM)
Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
In this article, we will discuss Regression analysis in Machine Learning which is one of the  important concepts used in building machine learning models.
Least Square Regression in Machine Learning
Least Square Regression in Machine Learning
This article discusses the concept of linear regression. We have also covered least square regression in machine learning. Let’s begin! The article covers the concept of Least Square Regression...read more
Difference Between Linear Regression and Logistic Regression
Difference Between Linear Regression and Logistic Regression
Linear Regression and Logistic Regression are both Supervised Machine Learning models that make use of labelled datasets to make predictions. However, there’s a fundamental difference in their usage – Linear...read more

Decision Tree

As the name suggests, this algorithm builds “branches” in a hierarchical manner where each branch can be considered as an if-else statement. The branches divide the dataset into subsets based on the most important features. The “leaves” of the decision tree are where the final classifications happen.

Decision Tree

Random Forest

Like a forest has trees, a random forest is a collection of decision trees. This classifier aggregates the results from multiple predictors. It additionally utilizes the bagging technique that allows each tree to be trained on a random sampling of the original dataset and takes the majority vote from trees. A random forest classifier has better generalization but is less interpretable than a decision tree classifier, naturally because more layers are added to the model.

Random Forest

Naïve Bayes

This classifier data into different classes according to the Bayes’ Theorem. But assumes that the relationship between all input features in a class is independent. Hence, the model is called naïve. This algorithm works relatively well even when the size of the training dataset is small. Naïve Bayes is commonly used for text classification, sentiment analysis, etc.

You can understand the working of the Naïve Bayes algorithm in-depth here.

Classification algorithms are either Lazy learners or Eager learners:

  • Lazy learners simply store the training data and wait for the testing data. They perform classification after that based on the most related data in the stored training set. Lazy learners take less time to train but more time to predict. KNN is a lazy learner.
  • Eager learners are algorithms that build a classification model do not wait for the testing data to build a model. They perform classification based on the given training data before receiving the testing data. Eager learners take a long time to train due to model construction and less time to predict. Decision Trees and Naïve Bayes are examples of eager learners. 

Predicting Categorical Values Using Classification Algorithms

For demonstration, we are going to build a model that can predict whether a patient has heart disease or not based on the features provided in the dataset given here.

We will use the six classification algorithms we have discussed above. Based on their accuracy scores, we will select the best algorithm.

Let’s get started!

Step 1 – Import the required libraries

We use Python’s scikit-learn package when working with machine learning models:

 
#Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
Copy code

Step 2 – Load the dataset 

 
#Read the dataset
data=pd.read_csv('heart.csv')
#Display the first five rows
print(data.head())
Copy code
Load the dataset

Step 3 – Prepare the data

The column HeartDisease is our target variable. We will check how many classes does our target variable have:

 
data.groupby('HeartDisease').count()
Copy code
Prepare the data

Two classes: 0 (False – no disease) and 1 (True – heart disease).

 
#Check for null values
print(data.isnull().sum())
Copy code
Check for null values

Step 4 – Transform the data

Check data types of all columns:

 
#Check data types
data.dtypes
Copy code
Transform the data

So, there are object data types and a float type as well. We have to convert these labels to numeric (int64) form, so they become machine-readable. This is done through label encoding

 
def label_encoder(y):
le = LabelEncoder()
data[y] = le.fit_transform(data[y])
label_list = ["Sex","ChestPainType", "RestingECG","ExerciseAngina","Oldpeak", "ST_Slope"]
for l in label_list:
label_encoder(l)
#Display transformed data
data.head()
Copy code
Application

Description automatically generated with low confidence

Step 5 – Split the data

Split the data into training and testing sets:

 
#Divide the dataset into independent and dependent variables
X = data.drop(["HeartDisease"],axis=1)
y = data['HeartDisease']
#Split the data into training and testing set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,
random_state=42, shuffle=True)
#Data was splitted as 80% train data and 20% test data.
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)
Copy code
A screenshot of a computer

Description automatically generated with low confidence

Step 6 – Standardize the data

We will perform feature scaling to rescale data to have a mean of 0 and standard deviation of 1 (unit variance):

 
#Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
Copy code

Step 7 – Implement Classification Models

We will build all six models and compare their accuracy scores.

 
#To store results of models, we create two dictionaries
result_dict_train = {}
result_dict_test = {}
Copy code
  1. Logistic Regression
 
reg = LogisticRegression(random_state = 42)
accuracies = cross_val_score(reg, X_train, y_train, cv=5)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",reg.score(X_test,y_test))
Copy code
 
#Store results in the dictionaries
result_dict_train["Logistic Train Score"] = np.mean(accuracies)
result_dict_test["Logistic Test Score"] = reg.score(X_test,y_test)
Copy code
  1. KNN Classifier
 
knn = KNeighborsClassifier()
accuracies = cross_val_score(knn, X_train, y_train, cv=5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",knn.score(X_test,y_test))
Copy code
 
#Store results in the dictionaries
result_dict_train["KNN Train Score"] = np.mean(accuracies)
result_dict_test["KNN Test Score"] = knn.score(X_test,y_test)
Copy code
  1. Support Vector Classifier
 
svc = SVC(random_state = 42)
accuracies = cross_val_score(svc, X_train, y_train, cv=5)
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",svc.score(X_test,y_test))
Copy code
 
#Store results in the dictionaries
result_dict_train["SVM Train Score"] = np.mean(accuracies)
result_dict_test["SVM Test Score"] = svc.score(X_test,y_test)
Copy code
  1. Decision Tree Classifier
 
dtc = DecisionTreeClassifier(random_state = 42)
accuracies = cross_val_score(dtc, X_train, y_train, cv=5)
dtc.fit(X_train,y_train)
y_pred = dtc.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",dtc.score(X_test,y_test))
Copy code
Text

Description automatically generated
 
#Store results in the dictionaries
result_dict_train["Decision Tree Train Score"] = np.mean(accuracies)
result_dict_test["Decision Tree Test Score"] = dtc.score(X_test,y_test)
Copy code
  1. Random Forest Classifier
 
rfc = RandomForestClassifier(random_state = 42)
accuracies = cross_val_score(rfc, X_train, y_train, cv=5)
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",rfc.score(X_test,y_test))
Copy code
 
#Store results in the dictionaries
result_dict_train["Random Forest Train Score"] = np.mean(accuracies)
result_dict_test["Random Forest Test Score"] = rfc.score(X_test,y_test)
Copy code
  1. Naïve Bayes Classifier
 
gnb = GaussianNB()
accuracies = cross_val_score(gnb, X_train, y_train, cv=5)
gnb.fit(X_train,y_train)
y_pred = gnb.predict(X_test)
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",gnb.score(X_test,y_test))
Copy code
 
#Store results in the dictionaries
result_dict_train["Gaussian NB Train Score"] = np.mean(accuracies)
result_dict_test["Gaussian NB Test Score"] = gnb.score(X_test,y_test)
Copy code

Step 8 – Compare Accuracy Scores

 
df_result_train = pd.DataFrame.from_dict(result_dict_train,orient = "index", columns=["Score"])
df_result_train
Copy code
Compare Accuracy Scores

Let’s display the accuracy scores of all testing models:

 
df_result_test = pd.DataFrame.from_dict(result_dict_test,orient = "index",columns=["Score"])
df_result_test
Copy code
display the accuracy scores

Let’s visualize the scores, shall we?

 
import seaborn as sns
fig,ax = plt.subplots(1,2,figsize=(20,5))
sns.barplot(x = df_result_train.index,y = df_result_train.Score,ax = ax[0])
sns.barplot(x = df_result_test.index,y = df_result_test.Score,ax = ax[1])
ax[0].set_xticklabels(df_result_train.index,rotation = 75)
ax[1].set_xticklabels(df_result_test.index,rotation = 75)
plt.show()
Copy code

import seaborn
Chart, bar chart

Description automatically generated

From the above graphs, we can conclude the following:

  • The Random Forest classifier has the highest test score
  • The Decision Tree classifier has the lowest score among all classifiers.

Once you have trained your model, the next important step is to evaluate and optimize the classifier to verify its applicability. Learn how to perform model evaluation here. 

Endnotes

Having a clear understanding of choosing the correct classification model that deploys the best possible solution plays is instrumental in solving supervised Machine Learning problems. Artificial Intelligence & Machine Learning is an increasingly growing domain that has hugely impacted big businesses worldwide. Interested in being a part of this frenzy? Explore related articles here.


Top Trending Articles:

Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio