Assumptions of Linear Regression

# Assumptions of Linear Regression

clickHere
Vikram Singh
Assistant Manager - Content
Updated on Nov 21, 2022 11:30 IST

There are certain assumptions of linear regression algorithm that must be satisfied before implementing over any model, otherwise it will lead to insignificant result.

Linear regression is a supervised machine-learning algorithm that models a linear relationship between two or more continuous variables (dependent and independent). The linear regression algorithm finds the best fit line that passes through all the points with the minimum difference between the actual and predicted value (Estimated Value).

#### Equation of Linear Regression:

Y = mX + C

where

Y: dependent variable

X: independent variable

C: y-intercept

m: slope or regression coefficient

Linear Regression algorithm has a set of assumptions that must be satisfied while building any linear regression model that produces the best fit line (regression line) for any given dataset.

Linear regression algorithm uses a set of parameters while learning from the dataset, and due to parametric, it has some restrictions. If the algorithm fails to satisfy the assumptions, it will fail to predict the best fit line.

Linear Regression Algorithm has the following assumptions:

Now, let’s discuss them one-by-one:

## 1. Linearity

As the name suggests, in Linear regression, the relation between the dependent and independent variables must be linear.
General Linear Equation can be given as

Y = C0 + C1X1 + C2X2 + C3X3 + ……+CnXn

where,
Ci: constant
xi: independent variable

How to Check whether the given equation is linear or not:

f(ax+by) = af(x) + bf(y)
Where,
a, b: constant
x, y: independent variable

If the linear regression algorithm fails the linearity assumption, it will fail to capture the trend, which will lead to a false prediction.

Also read :How to Calculate R squared in Linear Regression

## 2. No Hidden or Missing Value

All the variables that are used in the linear regression algorithm must be relevant and not have any missing value or hidden values. If any variable contains the missing value, it will lead to a false prediction or insignificant prediction. These missing values can be handled by:

• Deleting rows with missing values
• Replacing with arbitrary variable
• Interpolation
Handling missing values: Beginners Tutorial
We take data from sometimes sources like kaggle.com, sometimes we collect from different sources by doing web scrapping containing missing values in it. But do you think
Handling missing data: Mean, Median, Mode
So what all steps do we actually perform in what kind of order to complete the feature engineering process. Now in a data science project if we just consider feature...read more
Normalization vs Standardization
Normalization and standardization are two techniques used to transform data into a common scale. Normalization is a technique used to scale numerical data in the range of 0 to 1....read more

## 3. Multicollinearity

There should not be any correlation between the independent variables. Having collinearity between the independent variable will lead to an increase in the complexity of the model. As the linear regression algorithm checks the effect of each independent variable, it will be difficult to isolate the impact of an individual variable over the dependent variable if there exists a correlation between the variables.
In simple terms, if there exists multicollinearity between the variable, it will be difficult to find which independent variable has a significant impact on the dependent variable (i.e., which feature is impacting more to predict the model).

Multicollinearity can be tested using the correlation matrix and tolerance (tolerance = 1-R2).

## 4. Autocorrelation in the residuals (Error terms):

Autocorrelation occurs if errors depend on each other. The error term in the linear regression should be linearly independent and identically distributed.

• Autocorrelation mostly occurs in time series models, where the next instant depends on the previous instant.
• Due to autocorrelation, the estimated standard error tends to underestimate the true standard error.
• Due to autocorrelation, the confidence interval and predication interval become narrower.

## 5. Normality (Gauss Distribution)

The residual or the error term must follow a normal distribution. The normality condition can be relaxed when there are many observations. Still, in the case of a small number of observations, the standard error in the model will be unreliable.

The histogram or QQ plot can check the normality of the linear regression equation. Most of the residual values in the plot will lie near zero.

Normality in the linear regression can be fixed by:

• If there is an outlier, remove it or use the least-square method.
• If possible, add more observations.

## 6. Homoscedasticity

Homoscedasticity means the error term should be constant along the values of the dependent variable. It can be interpreted by the scatter plot with the residual against the dependent variable. It may be due to the presence of outliers or incorrectly specified models.
If the algorithm does not satisfy the assumption of homoscedasticity, it can be fixed by applying a logistic or square root transformation to the dependent variable.

## Conclusion

In this article, we have discussed some of the important assumptions of linear regression algorithms that must be satisfied before applying to any model; otherwise, the predicted value may be insignificant.

Top Trending Article

Interview Questions