Assumptions of Linear Regression

Assumptions of Linear Regression

4 mins read1K Views Comment
clickHere
Vikram
Vikram Singh
Assistant Manager - Content
Updated on Nov 21, 2022 11:30 IST

There are certain assumptions of linear regression algorithm that must be satisfied before implementing over any model, otherwise it will lead to insignificant result.

2022_11_MicrosoftTeams-image-82-1.jpg

Linear regression is a supervised machine-learning algorithm that models a linear relationship between two or more continuous variables (dependent and independent). The linear regression algorithm finds the best fit line that passes through all the points with the minimum difference between the actual and predicted value (Estimated Value).

Equation of Linear Regression:

Y = mX + C 

where

Y: dependent variable

X: independent variable

C: y-intercept

m: slope or regression coefficient

Linear Regression algorithm has a set of assumptions that must be satisfied while building any linear regression model that produces the best fit line (regression line) for any given dataset.

What is Programming What is Python
What is Data Science What is Machine Learning

Linear regression algorithm uses a set of parameters while learning from the dataset, and due to parametric, it has some restrictions. If the algorithm fails to satisfy the assumptions, it will fail to predict the best fit line.

Linear Regression Algorithm has the following assumptions:

Now, let’s discuss them one-by-one:

1. Linearity

As the name suggests, in Linear regression, the relation between the dependent and independent variables must be linear.
General Linear Equation can be given as

Y = C0 + C1X1 + C2X2 + C3X3 + ……+CnXn 

where,
Ci: constant
xi: independent variable

How to Check whether the given equation is linear or not:

f(ax+by) = af(x) + bf(y)
Where,
a, b: constant
x, y: independent variable

If the linear regression algorithm fails the linearity assumption, it will fail to capture the trend, which will lead to a false prediction.

Also read :How to Calculate R squared in Linear Regression

Also read: r-squared vs. adjusted r-squared

2. No Hidden or Missing Value

All the variables that are used in the linear regression algorithm must be relevant and not have any missing value or hidden values. If any variable contains the missing value, it will lead to a false prediction or insignificant prediction. These missing values can be handled by:

  • Deleting rows with missing values
  • Replacing with arbitrary variable
  • Interpolation
Handling missing values: Beginners Tutorial
Handling missing values: Beginners Tutorial
We take data from sometimes sources like kaggle.com, sometimes we collect from different sources by doing web scrapping containing missing values in it. But do you think 
Handling missing data: Mean, Median, Mode
Handling missing data: Mean, Median, Mode
So what all steps do we actually perform in what kind of order to complete the feature engineering process. Now in a data science project if we just consider feature...read more
Normalization vs Standardization
Normalization vs Standardization
Normalization and standardization are two techniques used to transform data into a common scale. Normalization is a technique used to scale numerical data in the range of 0 to 1....read more

3. Multicollinearity

There should not be any correlation between the independent variables. Having collinearity between the independent variable will lead to an increase in the complexity of the model. As the linear regression algorithm checks the effect of each independent variable, it will be difficult to isolate the impact of an individual variable over the dependent variable if there exists a correlation between the variables.
In simple terms, if there exists multicollinearity between the variable, it will be difficult to find which independent variable has a significant impact on the dependent variable (i.e., which feature is impacting more to predict the model).

Multicollinearity can be tested using the correlation matrix and tolerance (tolerance = 1-R2).

Programming Online Courses and Certification Python Online Courses and Certifications
Data Science Online Courses and Certifications Machine Learning Online Courses and Certifications

4. Autocorrelation in the residuals (Error terms):

Autocorrelation occurs if errors depend on each other. The error term in the linear regression should be linearly independent and identically distributed. 

  • Autocorrelation mostly occurs in time series models, where the next instant depends on the previous instant. 
  • Due to autocorrelation, the estimated standard error tends to underestimate the true standard error.
  • Due to autocorrelation, the confidence interval and predication interval become narrower. 

5. Normality (Gauss Distribution)

The residual or the error term must follow a normal distribution. The normality condition can be relaxed when there are many observations. Still, in the case of a small number of observations, the standard error in the model will be unreliable.

The histogram or QQ plot can check the normality of the linear regression equation. Most of the residual values in the plot will lie near zero.

Normality in the linear regression can be fixed by:

  • If there is an outlier, remove it or use the least-square method.
  • If possible, add more observations.

Also Read: Normal Distribution – Definition, and Example

6. Homoscedasticity

Homoscedasticity means the error term should be constant along the values of the dependent variable. It can be interpreted by the scatter plot with the residual against the dependent variable. It may be due to the presence of outliers or incorrectly specified models.
If the algorithm does not satisfy the assumption of homoscedasticity, it can be fixed by applying a logistic or square root transformation to the dependent variable.

Conclusion

In this article, we have discussed some of the important assumptions of linear regression algorithms that must be satisfied before applying to any model; otherwise, the predicted value may be insignificant.
Hope this article is useful to you.

Top Trending Article

Top Online Python Compiler | How to Check if a Python String is Palindrome | Feature Selection Technique | Conditional Statement in Python | How to Find Armstrong Number in Python | Data Types in Python | How to Find Second Occurrence of Sub-String in Python String | For Loop in Python |Prime Number | Inheritance in Python | Validating Password using Python Regex | Python List |Market Basket Analysis in Python | Python Dictionary | Python While Loop | Python Split Function | Rock Paper Scissor Game in Python | Python String | How to Generate Random Number in Python | Python Program to Check Leap Year | Slicing in Python

Interview Questions

Data Science Interview Questions | Machine Learning Interview Questions | Statistics Interview Question | Coding Interview Questions | SQL Interview Questions | SQL Query Interview Questions | Data Engineering Interview Questions | Data Structure Interview Questions | Database Interview Questions | Data Modeling Interview Questions | Deep Learning Interview Questions |

About the Author
author-image
Vikram Singh
Assistant Manager - Content

Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio