How to Improve the Accuracy of Regression Model?

How to Improve the Accuracy of Regression Model?

11 mins read8.2K Views Comment
Updated on Nov 28, 2022 12:51 IST

In this article, we will discuss how to improve accuracy of regression model with the help of simple examples to understand the concepts.

2022_05_feature-images_REGRESSION-MODEL.jpg

Introduction

Companies use various machine learning models as a basis to make major business decisions, and therefore the accuracy of a model is important. More effective, robust, and reliable the ML model is, the more profitable it is for businesses. It can produce more accurate predictions and insights, which deliver more business value and less errors. In this article, let’s go through various ways to improve the accuracy of a regression model.

A few effective ways to improve the accuracy of your regression models are:

  1. Regularization
  2. Handling Missing & Null Values
    1. Deleting Missing Values
    2. Imputing Missing Values
    3. Imputing by Model-based Prediction
  3. Categorical Feature Encoding
    1. Label Encoding
    2. One-Hot Encoding
  4. Feature Engineering
  5. Conclusion

Regularization

One of the most common problems that you will come across when training your model is overfitting. A regression model is said to be overfitting when your algorithm is too complex, in a way that it fits the training data exceptionally well but fails to generalize the data that contains a lot of irrelevant data points. As in, the algorithm fits a limited set of data points very closely, thereby learning from noise in the data. This impacts the accuracy of the model. One of the ways of avoiding overfitting is using regularization.

2022_05_regularization.jpg

In general, regularization simply means making something regular or acceptable. In machine learning regularization is a form of regression that shrinks (regularizes) coefficient estimates to zero. You can simply think of it as a damper that will repress the extra weights on special features in your algorithm and redistribute them evenly. This way regularization prevents overfitting by discouraging the learning of a model of both high complexity and flexibility. Mainly, there are two types of regularization techniques: Lasso Regression, Ridge Regression.

Must Check: What is Machine Learning?

What is the Future of Machine Learning?
What is the Future of Machine Learning?
There are many fields in which we use machine learning.This blog will tell you some futuristic applications of machine learning.
Difference between Regression and Classification Algorithms
Difference between Regression and Classification Algorithms
Classification and regression are the very basic and important topics in machine learning. The article covers the major differences between Regression and Classification algorithms in machine learning.

It is a useful technique to improve the accuracy of your models. A popular Python library for implementing these algorithms is Scikit-Learn. Let’s try to learn this with the help of an example. First let’s import the libraries that we would require.

import numpy as np
from sklearn.datasets import make_regression
import pandas as pd
import matplotlib.pyplot as plt

Usually samples are made from a smaller portion of the population. In some cases this can be heavily skewed leading to overfitting of a linear model.

X_sample,y_sample = make_regression(n_samples = 100,n_features = 1,noise=0.5)
 
#defines the sample taken from the population which is screwed
plt.scatter(X_sample,y_sample)
plt.show()
2022_05_regularization_plot.show_.jpg
X_population, y_population = make_regression(n_samples=500,n_features=1,noise=40)
 
#this represents the entire population of true dataset
plt.scatter(X_population,y_population)
plt.show()
2022_05_regularization_represent-the-entire-population.jpg

Now, let’s try to compare the accuracy of the model before and after regularization.

from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_sample,y_sample)
 
#overfitting
training_accuracy = lr.score(X_sample,y_sample)
print("training accuracy is: ",training_accuracy)

Output

2022_05_image-18.jpg
#poor generalization
accuracy = lr.score(X_population,y_population)
print("accuracy on population is: ",accuracy)

Output

2022_05_image-19.jpg

Let’s try to check the accuracy using the ridge and lasso regression techniques.

#now using ridge instead of normal linear regression 
from sklearn import linear_model
ridge = linear_model.Ridge(alpha=10)
ridge.fit(X_sample,y_sample)
2022_05_image-20.jpg
training_ridge = ridge.score(X_sample,y_sample)
print("training accuracy with ridge ",training_ridge )

Output

2022_05_image-21.jpg
accuracy_lasso = ridge.score(X_population,y_population)
print("accuracy with lasso: ",accuracy_lasso)

Output

2022_05_image-22.jpg

As you can notice, regularization has increased the accuracy of generalization as it adds a penalty to the error. That said, it is important to correctly figure out which regularization technique works better for your model to get the accurate results.

Bagging Technique in Ensemble Learning
Bagging Technique in Ensemble Learning
In this article, we will discuss the concept how to solve machine learning problems using the ensemble learning bagging.
Ridge Regression vs Lasso Regression
Ridge Regression vs Lasso Regression
Regularization techniques (like Lasso and ridge regression) are used to enhance the performance and interpretability of the machine-learning model, especially in situations where the traditional linear regression method fails. Lasso...read more
Difference between Linear and Nonlinear Regression – Shiksha Online
Difference between Linear and Nonlinear Regression – Shiksha Online
The article covers the difference between Linear and Nonlinear regression algorithms

Handling Missing and Null Values

Treating missing and null values in datasets is a crucial issue especially when you are dealing with complete available data. Properly handling missing data improves the inferences and the predictions made. A particular data element can be missing due to various reasons such as failure to load the information, corrupt data, incomplete load operations, etc.

Let us explore how to deal with missing & null values to improve the accuracy of the model.

To demonstrate this, we are going to use House_Price.csv, one of the popular Kaggle dataset used to predict sales prices and many other details. Let’s load the dataset as Pandas DataFrame.

The first step is to check and mark missing values as shown below.

# load the dataset and check for missing values
import pandas as pd
# load the dataset
dataset = pd.read_csv('house_price.csv')
#check for missing values from each column
dataset.isnull().sum().sort_values(ascending=False)
2022_05_missing-value.jpg

As you can see the dataset has plenty of columns with a huge amount of null values. Therefore to get the clear picture let us try to get the top 7 columns with higher percentage of null values.

totals = dataset.isnull().sum().sort_values(ascending=False)
percentage = totals/len(dataset)*100
pd.concat([totals,percentage], axis=1, keys=['Total','Percent']).head(7)
2022_05_missing-value-percentage.jpg

We clearly have 4 columns with Null values higher than 80%. So, how do you handle this missing data?

Deleting the Missing Values

One of the simple strategies for handling missing data is to delete those records from a dataset that contain missing values. While this method is not really recommended, you can use it when there are lots of missing values in a particular column and when you are dealing with a large amount of data.

Let us go ahead and delete columns with null values greater than 80%.

#Method 1: Removing columns with missing values > 80%
dataset = dataset[dataset.columns[dataset.isnull().mean() < 0.80]]
dataset.isnull().sum().sort_values(ascending=False)
2022_05_removing-column-with-missing-value-greater-than-80-percent.jpg

When you completely get rid of data with missing values, the model is robust and accurate when the columns that you deleted do not have high weightage. However, if you remove those columns that have high correlation with output, then this method leads to less accuracy and loss of useful information.

Impute Missing Values

You can apply this strategy when you are dealing with features that have numeric data like age, salary, etc. There are various options that you can consider to replace the missing values.

  • A mean, median, or mode for the column
  • A random value from other randomly selected record
  • Value from previous record or next record
  • Most frequent value of that column

And many others. From the above dataset, we have a column called MSSubClass of integer type. In this column, let us replace the missing values with mean.

# Replacing missing values by the mean value 
dataset['MSSubClass'] = dataset['MSSubClass'].fillna(dataset['MSSubClass'].mean())
dataset['MSSubClass'].isnull().sum()

Imputing Missing Values for Categorical Features

In case of a non-numeric column, you cannot use mean or median. In such a case, you can replace the categorical features using the most frequent value or just treat it as a separate category. For example, let us replace the missing values of column FireplaceQu (FirePlace Quality) with the most frequent value.

# Before replacing - missing values count in FireplaceQu
dataset['FireplaceQu'].isnull().sum()
2022_05_image-23.jpg
# Imputing missing values for categorical variable
dataset['FireplaceQu'] = dataset['FireplaceQu'].fillna(dataset['FireplaceQu'].mode()[0])
dataset['FireplaceQu'].isnull().sum()
2022_05_image-24.jpg

Impute Missing Values with Model-based Prediction

In this model-based strategy, we use features which do not have null or missing values and train it to predict the missing values. This method gives us an advantage over replacing with median, mode, etc. This way of imputing missing data is more flexible.

What you are basically doing here is that, you are training your ML model to predict the missing values for a feature based on other features. The records (rows) without missing values are used as a training set and model also used values in other columns. You can use a classification or regression model to train based on the datatype of the feature. Post training, the model is then applied to samples with missing values to predict its most likely value.

Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
In this article, we will discuss Regression analysis in Machine Learning which is one of the  important concepts used in building machine learning models.
Outliers: Definition and Examples in Python
Outliers: Definition and Examples in Python
In this article we will learn about one of the most important techniques used in building machine learning projects, handling outliers in the dataset. This article is all about outliers:...read more
Difference Between Linear Regression and Logistic Regression
Difference Between Linear Regression and Logistic Regression
Linear Regression and Logistic Regression are both Supervised Machine Learning models that make use of labelled datasets to make predictions. However, there’s a fundamental difference in their usage – Linear...read more

Categorical Feature Encoding

Typically, a dataset has a combination of numerical and categorical values. However, most machine learning models work with numerical values only (or some other different values that are understandable to the ML model) and cannot operate on categorical data directly. This means that the categorical data has to be transformed to numerical form to be utilized by the algorithm. Categorical encoding is a data preparation technique that makes certain types of data readable by an algorithm and much easier to work with.

The three common categorical types of data are:

  • Binary: a set of two values. Example: Pass or Fail
  • Ordinal: a set of values in either ascending or descending order. Example: Rating from 1 to 10
  • Nominal: a set of values with no particular order. Example: A list of cities

Based on the type of categorical type that you are handling, you can use Label Encoding or One-hot Encoding.

Label Encoding

Label encoding or integer encoding can be used when you are handling ordinal variables.

For example, “high” is 1, “medium” is 2, and “low” is 3.

Often the integer values start at 0 and by default the integers are assigned to labels in the order that they appear in the data. The integer values here have a natural ordered relationship between each other and hence the ML algorithm can easily understand and use this relationship. The integer encoding or ordinal encoding technique is easily reversible. Let’s learn more with the help of an example.

Our first step is to download the dataset and import required Python packages as shown below.

# load the dataset
import pandas as pd
dataset = pd.read_csv('healthcare-dataset-stroke-data.csv')
# print shape
print('data shape :', dataset.shape)
2022_05_image-25.jpg

Let’s load the dataset and observe its shape. The above output indicates that we have 12 variables and 5110 rows.

Next, let us inspect a set of sample data using head().

#inspect data
dataset.head(5)
2022_05_inspect-dataset.jpg

We can try to understand the dataset more by gathering some useful information using the info() method from pandas.

#Get information about the dataset
print(dataset.info())
2022_05_dataset_info.jpg

The output above shows a list of features, size excluding null values, and data type for each variable. In the above example, we have around 5 features of object data type (most of them being categorical).

Let us now implement label encoding.

# Import the label encoder 
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Transform column 'gender' (object type) to the numerical data type
dataset['gender']= label_encoder.fit_transform(dataset['gender']) 
dataset.head(5)
2022_05_implement-label-encoding.jpg

As you can see, the gender feature is now encoded to numerical data types, where 1 represents male and 0 represents female. Let us now try to encode. One problem is that label encoding might not be clear enough when you use it on variables where no ordinal relationship exists.

Understanding Ridge Regression Using Python
Understanding Ridge Regression Using Python
Ridge regression is a regularization technique that penalizes the size of the regression coefficient based on the l2 norm. It is mainly used to eliminate multicollinearity in the model. This...read more
Top 10 Machine Learning Tools Used By Data Scientists
Top 10 Machine Learning Tools Used By Data Scientists
In this article we will discuss the Top 10 Machine Learning Tools used by Data Scientists.
Basics of Machine Learning – Definition and Concepts
Basics of Machine Learning – Definition and Concepts
This post will help you understand the emerging technology of today’s time- Machine Learning. Here we have covered basic concept of Machine Learning.

One-Hot Encoding

For categorical variables with no natural ordering or relationship between them, forcing an ordinal relationship via label encoding may result in poor performance or unexpected results. In such scenarios, one-hot encoding is useful. One-hot encoding converts each categorical value into a new categorical column and matches with binary values 0 & 1.

This technique is suitable when dealing with nominal variables.

One-Hot Encoding with Sklearn

Let us now learn how to implement one-hot encoding in Python with a simple example.

The first step is to define the data.

import pandas as pd
# create and define the data
# create DataFrame
dataf = pd.DataFrame({'Fruits': ['Apple', 'Mango', 'Banana', 'Apple', 'Orange', 'Orange'],
                   'Price': [85, 100, 50, 70, 60, 55]})
print("The original data")
print(dataf)
2022_05_dataframe_onehot-encoding.jpg

Next step is to perform one-hot encoding. To implement the one-hot encoding, you need to import the OneHotEncoder() function from sklearn library.

# importing OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# create an instance if one-hot encoder
encoder = OneHotEncoder(handle_unknown='ignore')
# encode the Fruits column
encoder_dataf = pd.DataFrame(encoder.fit_transform(dataf[['Fruits']]).toarray())
# merge encoded columns with the original column
final_dataf = dataf.join(encoder_dataf)
# rename the new columns if required
final_dataf.columns = ['Fruits', 'Price', 'Apple', 'Mango', 'Banana', 'Orange']
print("The final dataframe")
print(final_dataf)
2022_05_perform_one-hot-encoder.jpg

One-Hot Encoding Using Dummies Values Approach

Pandas, a popular Python library provides a function called get_dummies, using which you can enable one-hot encoding. This approach is more flexible as it allows you to encode as many category columns as required. You can also choose how to label the columns using a prefix.

import pandas as pd
import numpy as np
# create and define the data
# create DataFrame
dataf = pd.DataFrame({'Fruits': ['Apple', 'Mango', 'Banana', 'Apple', 'Orange', 'Orange'],
                   'Price': [85, 100, 50, 70, 60, 55]})
print("The original data")
print(dataf)
print("*" * 30)
# generate binary values using get_dummies function
encoder_dataf = pd.get_dummies(dataf, columns=["Fruits"])
print("The transformed data with get.dummies")
print(encoder_dataf)
2022_05_pd_get_dummies.jpg

Well, in the end it is important to know which encoding approach to use by being aware of its pros and cons.

Feature Engineering

Feature engineering is the process of using existing raw data & feature knowledge to formulate new useful features that better represent the underlying problem to predictive ML models. This inturn increases the accuracy of your model. As you might already know the features in your data will directly influence the model you use and the output that you achieve. Therefore, by creating good features you can augment the value of your existing data and thus improve the accuracy of your model.

Process that are involved in feature engineering are:

  1. Feature Importance: Estimate usefulness of features
  2. Feature Construction: Manually creating new features from raw data
  3. Feature Extraction: Automated formulation of new features from raw data
  4. Feature Transformations: Adjusting the features to improve the accuracy and performance of the model
  5. Feature Selection: Select a subset of features that are important to model

Obviously, when you are working with real world data, it is too rare to get it accurate in the first iteration. Feature engineering is an iterative process, which might look as follows:

2022_05_image-26.jpg

In the end, you need a well defined problem and domain knowledge to know when to stop this process and move on to trying other model configurations.

Conclusion

In this article we have discussed a few important ways to improve the accuracy of the regression model with the help of simple examples to understand the concept. Apart from what is discussed here, there are plenty other ways to improve the accuracy of your machine learning models. However, one important thing is that improving the accuracy of a ML model is a skill that you can master only with a lot of experimenting and practice.

Movie Recommendation System using Machine Learning
Movie Recommendation System using Machine Learning
Every time you open up YouTube just to figure out the solution to your problem or just get the latest news, you end up spending more time. A similar thing...read more
Must Known Data Science Trends and Technologies
Must Known Data Science Trends and Technologies
Before discussing Must Known Data Science Trends and Technologies, let’s discuss the elephant in the room:
How tech giants like Google, Facebook, Instagram are using your data?
How tech giants like Google, Facebook, Instagram are using your data?
If you have landed upon this blog, then depending on the source, it was all because of the robust machine learning model used by Google, or LinkedIn to show you...read more
About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio