Evaluation Metrics in Machine Learning

Evaluation Metrics in Machine Learning

9 mins read1.9K Views Comment
Vikram
Vikram Singh
Assistant Manager - Content
Updated on Oct 12, 2023 10:32 IST

Evaluation metrics are the compass guiding machine learning models towards accuracy and efficiency. Dive into this article to unravel the significance of these metrics, from the classic AUC-ROC to the nuanced F1-Score. Discover how the right metric can transform a model’s performance and why one size doesn’t fit all.

2022_02_Evaluation-Matric.jpg

Machine learning models are used to analyze and interpret data, but how do we measure how good or bad these models are? The answer is Evaluation Metrics. These matrices provide a clear benchmark for assessing a model’s performance, ensuring that the algorithm works and optimizes the task.
In this article, we will discuss evaluation metrics, their importance, and how to choose the best ones. Later in the article, we will also discuss different types of Evaluation matrices.

So, without further delay, let’s get started.

Table of Content

What is Evaluation Metrics?

Evaluation metrics are quantitative measures used to assess the performance of a statistical or machine learning model. These metrics provide insights into how well the model is performing and help in comparing different models or algorithms. When evaluating a machine learning model, it is crucial to assess its predictive ability, generalization capability, and overall quality.

There are different types of evaluation metrics available, depending on the specific machine learning task. Some of the common evaluation matrices are Precision, recall, F1-score, Mean Absolute Error, Mean Squared Error, R-squared, adjusted r-squared, etc.

Must Read: Top 10 Machine Learning Algorithms

Why is it Important?

Evaluation matrices are important as they help:

  • To assess the performance of a model: Evaluation metrics provide a quantitative measure of how well a model performs on a given task. 
    • This is essential for understanding a model’s strengths and weaknesses and deciding whether to deploy it to production.
  • To compare different models: Evaluation metrics can be used to compare machine learning models trained on the same dataset to solve the same problem. 
    • For example, if two models have similar accuracy scores, but if one has a higher precision score, that will be preferred.
  • To tune hyperparameters: Evaluation metrics are often used to tune the hyperparameters of a machine learning model. Hyperparameters control a model’s training process, such as the learning rate and the number of epochs. 
    • By adjusting the hyperparameters, data scientists can improve the performance of their models.
  • To monitor the performance of a model over time: Evaluation metrics can be used to monitor the performance of a machine learning model over time. This is important because models can degrade performance over time due to changes in the data distribution and concept drift. 
    • By monitoring the performance of a model, data scientists can identify any problems early and take corrective action.
  • To identify overfitting: Overfitting occurs when a model learns the training data too well and cannot generalize to new data. 
    • Evaluation metrics can identify overfitting by comparing the model’s performance on the training data to its performance on a held-out test set.

How to Choose the Best Evaluation Metrics?

To evaluate the machine learning models, you can follow these steps:

  • Choose the right evaluation metric: The choice of the evaluation metric will depend on the specific machine learning task and the desired outcome.
    • For a classification model, you can choose the accuracy, precision, recall, and F1 score as evaluation metrics.
    • For a Regression model, you may use mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) as evaluation metrics.
  • Split data into training and test sets: The training set is used to train the model, and the test set is used to evaluate the model’s performance on unseen data. This is to ensure that the model is more balanced with the training data.
    • To reduce the risk of overfitting, use a cross-validation technique.
  • Train and evaluate the multiple models: Try different machine learning algorithms and hyperparameters to see which models perform the best on the training data.
  • Select the best model: Once you have evaluated all of your models, you can select the best model based on the evaluation metrics. 
    • For example, if the model is going to be used to make high-stakes decisions, then it is important to select a model with high accuracy and precision.

Until now, we have a clear understanding of what evaluation metrices are, its importance, and how to choose the best evaluation metrices. Now, it’s time to explore what are the different types of evaluation metrices available.

Types of Evaluation Metrics

On the broader level, the evaluation model is classified into:

  • Regression Metrics
  • Classification Metrics

Regression Metrics

Mean Absolute Error

The Mean Absolute Error (or MAE) tells the average of the absolute differences between predicted and actual values. By calculating MAE we can get an idea of how wrong the predictions were done by the model. 

2022_02_Mean-Absolute-Error.jpg

The above graph shows the salary of an employee vs experience in years. We have the actual value on the line and the predicted value is shown with X. And the absolute distance between them is a mean absolute error.

Mean Square Error

The Mean Squared Error (or MSE) is the same as the mean absolute error. Both tell the average of the differences between predicted and actual values and the magnitude of the error. 

2022_02_Mean-square-error.jpg

Note: That means if the value is lower then our model will be predicting more accurately.  

2022_02_formula-mse-mae.jpg

Where:

Yj: actual value

Y^ j: predicted value from the regression model

N: number of data points

Must Check: Mean Squared Error

Root Mean Squared Error (RMSE)

It is the square root of the mean of the square of all of the errors. Root Mean Square Error (RMSE) measures the error between two data sets. In other words, it compares an observed or known value and a predicted value.

Where

Oi = observations 

Si= predicted values of a variable 

n = number of observations 

R-Squared

It is a comparison of the residual sum of squares (SSres) with the total sum of square. R square is used to check the goodness of fit of a regression line. The closer the value of r-square to 1, the better the model fit.

 

2022_02_r-square.jpg

Must Check: Difference Between R-squared and Adjusted R-Squared

Classification Metrics

For every classification model, a confusion matrix is used to check the performance of any given set of test data.

Confusion Matrix

A confusion matrix is a summary of correct and incorrect predictions and helps visualize the outcomes.

Confusion matrix something looks like this:

Actual 0 Actual 1
Predicted 0 True Negative (TN) False Negative (FN)
Predicted 1 FalsePositive(FP) True Positive (TP)

where,

True Positive (TP): Predicted positive and it’s true.
True Negative (TN): Predicted negative and it’s true.
False Positive (FP): Predicted positive and it’s false.
False Negative (FN): Predicted negative and it’s false.

Now, here are some evaluation matrices that are base on the confusion matrix.

Accuracy

Accuracy is one of the most commonly used evaluation metrics in classification problems. It measures the proportion of correct predictions in the total prediction made. It is defined as:

Accuracy = Number of Correct Predictions/Total Number of Predictions

Mathematically, it is defined as:

Accuracy = TP + TN / (TP + TN + FP + FN)

Must Read: How to Improve Accuracy of Regression Model

Precision

Precision evaluates the accuracy of the positive prediction made by the classifier. In simple terms, precision answers the question: “Of all the instances that the model predicted as positive, how many were actually positive”.

Mathematically it is defined as:

Precision = True Positive (TP) / True Positive (TP) + False Positive (FP)

Must Check: Precision Handling in Python

Recall

The recall is also known as sensitivity or true positive rate. It is the ratio of the number of true positive predictions to the total number of actual positive instances in the dataset. Recall measures the ability of a model to identify all relevant instances.

Mathematically, recall is defined as:

Recall = True Positive (TP) / True Positive (TP) + False Negative (FN)

Must Read: Recall Formula

Must Read: Precision and Recall

F1-Score

F1 score is the harmonic mean of precision and recall. It provide a single metric that balances the trade-off between precision and recall. It is espically useful when the class distribution is imbalanced.

Mathematically, it is given by:

F1 Score = 2 x [(Precision x Recall)/ (Precision + Recall)]

The F1-score ranges between 0 and 1.
1: indicates perfect precision and recall
0: neither precision nor recall

Must Read: How to Calculate F1-Score in Machine Learning

AUC-ROC Curve

AUC-ROC stands for the Area Under the Receiver Operating Characteristic Curve. ROC curve is a graphical representation of classification model performance at different thresholds. It is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR). Whereas AUC represents the area under the ROC curve. It provides a single scalar value that summarizes the overall performance of a classifier across all possible threshold values.

The formula of TPR ad FPR:

True Positive Rate (TPR/Sensitivity/Recall) = True Positive / True Positive + False Negative
False Positive Rate (FPR) = False Positive / False Positive + True Negative

A typical AUC-ROC curve looks like:

2023_10_auc-roc-curve.jpg

Must Check: Difference Between Sensitivity and Specificity

Must Check: Difference Between AUC-ROC and Accuracy

FAQs

What is Evaluation Metrics?

Evaluation metrics are quantitative measures used to assess the performance of a statistical or machine learning model. These metrics provide insights into how well the model is performing and help in comparing different models or algorithms. When evaluating a machine learning model, it is crucial to assess its predictive ability, generalization capability, and overall quality.

What is Mean Absolute Error?

The Mean Absolute Error (or MAE) tells the average of the absolute differences between predicted and actual values. By calculating MAE we can get an idea of how wrong the predictions were done by the model.

What is Mean Squared Error?

The Mean Squared Error (or MSE) is the same as the Mean absolute error. Both tell the average of the differences between predicted and actual values and the magnitude of the error.

What is Root Mean Squared Error?

It is the square root of the mean of the square of all of the errors. Root Mean Square Error (RMSE) measures the error between two data sets. In other words, it compares an observed or known value and a predicted value.

What is R squared?

It is a comparison of the residual sum of squares (SSres) with the total sum of square. R square is used to check the goodness of fit of a regression line. The closer the value of r-square to 1, the better the model fit.

What is Accuracy?

Accuracy is one of the most commonly used evaluation metrics in classification problems. It measures the proportion of correct predictions in the total prediction made. It is defined as: Accuracy = Number of Correct Predictions/Total Number of Predictions

What is Precision?

Precision evaluates the accuracy of the positive prediction made by the classifier. In simple terms, precision answers the question, Of all the instances that the model predicted as positive, how many were actually positive.

What is Recall?

The recall is also known as sensitivity or true positive rate. It is the ratio of the number of true positive predictions to the total number of actual positive instances in the dataset. Recall measures the ability of a model to identify all relevant instances.

What is F1 Score?

F1 score is the harmonic mean of precision and recall. It provide a single metric that balances the trade-off between precision and recall. It is espically useful when the class distribution is imbalanced.

About the Author
author-image
Vikram Singh
Assistant Manager - Content

Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio