Predicting Categorical Data Using Classification Algorithms

7 mins read16.1K Views Comment

Call 8585951111Got Doubts?

Updated on Jan 27, 2023 17:12 IST

This article will demonstrate how you can build classification models using ML’s favorite programming language – Python.

Classification Algorithms are Supervised Machine Learning Algorithms that use labeled data (aka training datasets) to train classifier models. These models then predict outcomes with the best possible accuracy when new data (aka testing datasets) is fed to them.

The outcome predicted by a classification algorithm is categorical in nature. These algorithms classify variables into a specific set of classes – such as classifying a text message into transactions or promotions through an SMS filter on your iPhones.

We are going to cover the following sections:

Overview of Classification Algorithms
How do Classification Algorithms work?
Types of Classification Algorithms
Predicting Categorical Values Using Classification Algorithms
Endnotes

Overview of Classification Algorithms

Classification techniques predict discrete class label output(s) to which the data elements belong. For example, weather prediction is a type of classification problem – ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there are only two classes.

Stay updated with the latest blogs on online courses and skills

Enter Mobile Number

Few more examples of classification problems –

Speech recognition
Face detection
Spam texts/e-mails classification
Stock market prediction
Breast cancer detection
Employee Attrition prediction

Difference between Regression and Classification Algorithms

Regression algorithms predict continuous outcomes based on input data, estimating relationships between variables. Classification algorithms, on the other hand, assign data to discrete categories or classes. Both are fundamental in...read more

Read Later

Basics of Machine Learning – Definition and Concepts

This post will help you understand the emerging technology of today’s time- Machine Learning. Here we have covered basic concept of Machine Learning.

Read Later

Data Structures and Algorithms in Python – All You Need to Know

Python is a high-level, object-oriented programming language. It is a general-purpose language that is used in a variety of applications such as software testing, web development, data science, machine learning,...read more

Read Later

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

MCA in Machine Learning & Artificial Intelligence (ML & AI) (Online MCA)

TCS ionDegree

Total Fees

₹2.75 L

Duration

2 years

How do Classification Algorithms work?

A classifier utilizes known (training) data to understand how the given input (dependent) variables relate to the target (independent) variable.

In the above example, we will take into account the outside temperatures of previous days and use that as the training data. This data would be fed into the classifier – if it is trained accurately, it would be able to predict future weather conditions.

We use Binary Classifiers in case there are only two classes and Multi-class Classifiers for more than two class divisions.

Types of Classification Algorithms

When to use which algorithm would depend on the application and the nature of the data. The most common classification algorithms include:

Logistic Regression
K Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Decision Tree
Random Forest
Naïve Bayes

Logistic Regression

Note that, though the name is Logistic “Regression” it is actually a linear classification algorithm. It is used when the classes are linearly separable and binary – like true (1) or false (0), win (1) or lose (0), etc.

Logistics regression uses a sigmoid function to return the probability of a label. The curve obtained is called a sigmoid curve or an S-curve. The function generates a probability output. By comparing the probability with a pre-defined threshold, the object is assigned to a label accordingly.

K-Nearest Neighbors (KNN)

If your dataset has n-features, KNN represents each data point in an n-dimensional space. It then calculates the distance between the data points. The unobserved data is then assigned the label of the nearest observed data points. KNN is commonly used for recommender systems, credit scoring, etc.

Support Vector Machine (SVM)

Support vector classifier lets you define a set of hyper-planes, called decision boundary, that separates the data points into specific classes. The data points closest to the decision boundary are called support vectors. An optimum decision boundary will have a maximum distance from each of the support vectors. Margins are the shortest perpendicular distance between the support vectors and the decision boundary.

Regression Analysis in Machine Learning

In this article, we will discuss Regression analysis in Machine Learning which is one of the important concepts used in building machine learning models.

Read Later

Least Square Regression in Machine Learning

This article discusses the concept of linear regression. We have also covered least square regression in machine learning. Let’s begin! The article covers the concept of Least Square Regression...read more

Read Later

Difference Between Linear Regression and Logistic Regression

Linear Regression and Logistic Regression are both Supervised Machine Learning models that make use of labelled datasets to make predictions. However, there’s a fundamental difference in their usage – Linear...read more

Read Later

Decision Tree

As the name suggests, this algorithm builds “branches” in a hierarchical manner where each branch can be considered as an if-else statement. The branches divide the dataset into subsets based on the most important features. The “leaves” of the decision tree are where the final classifications happen.

Random Forest

Like a forest has trees, a random forest is a collection of decision trees. This classifier aggregates the results from multiple predictors. It additionally utilizes the bagging technique that allows each tree to be trained on a random sampling of the original dataset and takes the majority vote from trees. A random forest classifier has better generalization but is less interpretable than a decision tree classifier, naturally because more layers are added to the model.

Naïve Bayes

This classifier data into different classes according to the Bayes’ Theorem. But assumes that the relationship between all input features in a class is independent. Hence, the model is called naïve. This algorithm works relatively well even when the size of the training dataset is small. Naïve Bayes is commonly used for text classification, sentiment analysis, etc.

You can understand the working of the Naïve Bayes algorithm in-depth here.

Classification algorithms are either Lazy learners or Eager learners:

Lazy learners simply store the training data and wait for the testing data. They perform classification after that based on the most related data in the stored training set. Lazy learners take less time to train but more time to predict. KNN is a lazy learner.

Eager learners are algorithms that build a classification model do not wait for the testing data to build a model. They perform classification based on the given training data before receiving the testing data. Eager learners take a long time to train due to model construction and less time to predict. Decision Trees and Naïve Bayes are examples of eager learners.

Predicting Categorical Values Using Classification Algorithms

For demonstration, we are going to build a model that can predict whether a patient has heart disease or not based on the features provided in the dataset given here.

We will use the six classification algorithms we have discussed above. Based on their accuracy scores, we will select the best algorithm.

Let’s get started!

Step 1 – Import the required libraries

We use Python’s scikit-learn package when working with machine learning models:

#Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
 
Copy code

Step 2 – Load the dataset

#Read the dataset
data=pd.read_csv('heart.csv')
 
#Display the first five rows
print(data.head())
Copy code

Step 3 – Prepare the data

The column HeartDisease is our target variable. We will check how many classes does our target variable have:

data.groupby('HeartDisease').count()
Copy code

Two classes: 0 (False – no disease) and 1 (True – heart disease).

#Check for null values
print(data.isnull().sum())
Copy code

Step 4 – Transform the data

Check data types of all columns:

#Check data types
data.dtypes
Copy code

So, there are object data types and a float type as well. We have to convert these labels to numeric (int64) form, so they become machine-readable. This is done through label encoding:

def label_encoder(y):
    le = LabelEncoder()
    data[y] = le.fit_transform(data[y])
 
label_list = ["Sex","ChestPainType", "RestingECG","ExerciseAngina","Oldpeak", "ST_Slope"]
 
for l in label_list:
    label_encoder(l)
 
#Display transformed data
data.head()
Copy code

Application

Description automatically generated with low confidence

Step 5 – Split the data

Split the data into training and testing sets:

#Divide the dataset into independent and dependent variables
X = data.drop(["HeartDisease"],axis=1)
y = data['HeartDisease']
 
#Split the data into training and testing set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,
                                               random_state=42, shuffle=True) 
 
#Data was splitted as 80% train data and 20% test data.
 
y_train = y_train.values.reshape(-1,1)
y_test = y_test.values.reshape(-1,1)
 
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)
Copy code

A screenshot of a computer

Description automatically generated with low confidence

Step 6 – Standardize the data

We will perform feature scaling to rescale data to have a mean of 0 and standard deviation of 1 (unit variance):

#Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
Copy code

Step 7 – Implement Classification Models

We will build all six models and compare their accuracy scores.

#To store results of models, we create two dictionaries
result_dict_train = {}
result_dict_test = {}
Copy code

Logistic Regression

reg = LogisticRegression(random_state = 42)
accuracies = cross_val_score(reg, X_train, y_train, cv=5)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",reg.score(X_test,y_test))
Copy code

#Store results in the dictionaries
result_dict_train["Logistic Train Score"] = np.mean(accuracies)
result_dict_test["Logistic Test Score"] = reg.score(X_test,y_test)
Copy code

KNN Classifier

knn = KNeighborsClassifier()
accuracies = cross_val_score(knn, X_train, y_train, cv=5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",knn.score(X_test,y_test))
Copy code

#Store results in the dictionaries
result_dict_train["KNN Train Score"] = np.mean(accuracies)
result_dict_test["KNN Test Score"] = knn.score(X_test,y_test)
Copy code

Support Vector Classifier

svc = SVC(random_state = 42)
accuracies = cross_val_score(svc, X_train, y_train, cv=5)
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",svc.score(X_test,y_test))
Copy code

#Store results in the dictionaries
result_dict_train["SVM Train Score"] = np.mean(accuracies)
result_dict_test["SVM Test Score"] = svc.score(X_test,y_test)
Copy code

Decision Tree Classifier

dtc = DecisionTreeClassifier(random_state = 42)
accuracies = cross_val_score(dtc, X_train, y_train, cv=5)
dtc.fit(X_train,y_train)
y_pred = dtc.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",dtc.score(X_test,y_test))
Copy code

Text

Description automatically generated

#Store results in the dictionaries
result_dict_train["Decision Tree Train Score"] = np.mean(accuracies)
result_dict_test["Decision Tree Test Score"] = dtc.score(X_test,y_test)
Copy code

Random Forest Classifier

rfc = RandomForestClassifier(random_state = 42)
accuracies = cross_val_score(rfc, X_train, y_train, cv=5)
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",rfc.score(X_test,y_test))
Copy code

#Store results in the dictionaries
result_dict_train["Random Forest Train Score"] = np.mean(accuracies)
result_dict_test["Random Forest Test Score"] = rfc.score(X_test,y_test)
Copy code

Naïve Bayes Classifier

gnb = GaussianNB()
accuracies = cross_val_score(gnb, X_train, y_train, cv=5)
gnb.fit(X_train,y_train)
y_pred = gnb.predict(X_test)
 
#Obtain accuracy
print("Train Score:",np.mean(accuracies))
print("Test Score:",gnb.score(X_test,y_test))
Copy code

#Store results in the dictionaries
result_dict_train["Gaussian NB Train Score"] = np.mean(accuracies)
result_dict_test["Gaussian NB Test Score"] = gnb.score(X_test,y_test)
Copy code

Step 8 – Compare Accuracy Scores

df_result_train = pd.DataFrame.from_dict(result_dict_train,orient = "index", columns=["Score"])
df_result_train
Copy code

Let’s display the accuracy scores of all testing models:

df_result_test = pd.DataFrame.from_dict(result_dict_test,orient = "index",columns=["Score"])
df_result_test
Copy code

Let’s visualize the scores, shall we?

import seaborn as sns
 
fig,ax = plt.subplots(1,2,figsize=(20,5))
sns.barplot(x = df_result_train.index,y = df_result_train.Score,ax = ax[0])
sns.barplot(x = df_result_test.index,y = df_result_test.Score,ax = ax[1])
ax[0].set_xticklabels(df_result_train.index,rotation = 75)
ax[1].set_xticklabels(df_result_test.index,rotation = 75)
plt.show()
Copy code

Chart, bar chart

Description automatically generated

From the above graphs, we can conclude the following:

The Random Forest classifier has the highest test score
The Decision Tree classifier has the lowest score among all classifiers.

Once you have trained your model, the next important step is to evaluate and optimize the classifier to verify its applicability. Learn how to perform model evaluation here.

Endnotes

Having a clear understanding of choosing the correct classification model that deploys the best possible solution plays is instrumental in solving supervised Machine Learning problems. Artificial Intelligence & Machine Learning is an increasingly growing domain that has hugely impacted big businesses worldwide. Interested in being a part of this frenzy? Explore related articles here.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski Read Full Bio

Predicting Categorical Data Using Classification Algorithms

Overview of Classification Algorithms

Best-suited Machine Learning courses for you

MCA in Machine Learning & Artificial Intelligence (ML & AI) (Online MCA)

How do Classification Algorithms work?

Types of Classification Algorithms

Logistic Regression

K-Nearest Neighbors (KNN)

Support Vector Machine (SVM)

Random Forest

Naïve Bayes

Predicting Categorical Values Using Classification Algorithms

Step 1 – Import the required libraries

Step 2 – Load the dataset

Step 3 – Prepare the data

Step 4 – Transform the data

Step 5 – Split the data

Step 6 – Standardize the data

Step 7 – Implement Classification Models

Step 8 – Compare Accuracy Scores

Endnotes

Comments

Top Picks & New Arrivals