Multicollinearity in machine learning
Multicollinearity appears when two or more independent variables in the regression model are correlated. In this article we will discuss multicollinearity in machine leaning where the topics like types of multicollinearity,how to remove it are included.
It’s worth remembering that one factor that affects the standard error of the partial regression is the degree to which that independent variable is correlated with the other independent variables in the regression equation. All other things being equal, an independent variable is highly correlated with one or more other independent variables. The variables have relatively large standard errors. This is, partial regression coefficients are unstable and fluctuate widely. From rehearsal to next. This is a known situation of Multicollinearity.
Table of contents
- What is Multicollinearity?
- Multicollinearity real-life example
- Problem if have multicollinearity
- Types of Multicollinearity
- Removing Multicollinearity
- Conclusion
What is Multicollinearity?
Multicollinearity occurs when two or more predictors in one regression model are highly correlated. We almost always have Multicollinearity in the data. The question is whether we can get away with it; and what to do if Multicollinearity is so severe that we cannot ignore it.
Extreme Multicollinearity occurs whenever the independent variables are very high and Correlated with one or more other independent variables. In that case, the standard error of the partial regression coefficient for that independent variable is relatively large. Neither is likely to be statistically significant, even though they may be highly correlated with the dependent variable. The coefficients for these two independent variables are relative. Collinearity has a significant impact on coefficient accuracy.
Consequences of Multicollinearity
- Itwidens variances and covariances, making the statistical determination of the null and alternative hypotheses difficult.
- In the presence of Multicollinearity, standard errors increase, and t-test values decrease. Accept the null hypothesis, which should be rejected.
- It also increases R-squared, affecting the model’s goodness of fit.
Also Read: How tech giants are using your data?
Also read:What is machine learning?
Also read :Machine learning courses
Multicollinearity real-life example
Consider an example of a dataset in which we have to calculate the LPA based on IQ and CGPA grades. IQ and CGPA grades are Content features, and LPA is the dependent feature.
Suppose LPA is dependent on IQ and CGPA.So, in that case, we will use multiple linear regression whose Equation is like
y= 0+1X1+2 X2+………+n Xn
Where
1 and 2 are the coefficients
X1 and Y2 are the variables
We can write it as
LPA = 0+1IQ +2 CGPA
So we have to check that if we increase IQ, keeping CGPA constant, then what will increase LPA? But suppose there is collinearity in this which means IG and CGPA correlate with each other then if we increase IQ, then it will make CGPA increase as well, which means these coefficients will not be relevant.
In short, in this case, checking the contribution of related features(features having Multicollinearity) on output features would be difficult.
Problem if have multicollinearity
The idea is that you can change the value of one independent variable and not affect the other. However, when the independent variables are correlated, the changes in one variable are associated with changes in another. The stronger the correlation, the more difficult it is to change one variable without changing another. Independent variables tend to vary together, making it difficult for the model to estimate the relationship between each independent and dependent variable independently. Now a very important question arises here.
Is Multicollinearity always a problem? The answer is NO. Suppose you are using linear regression for prediction purposes. Then Multicollinearity will not affect much(even if it is there). But on the other hand, if you are using linear regression to check feature importance, then Multicollinearity will affect your model performance, and you should remove Multicollinearity first. And most cases, linear regression is used to check the feature’s importance.
Types of Multicollinearity
1. Structural Multicollinearity
It is a mathematical artifact caused by creating new predictors from other predictors (for example, creating predictor x2 from predictor x). For example-When, a data scientist adds new features to the dataset while using one hot encoding. Like you have feature city as shown in fig below-
So, in this case, we can remove the one column out of City A, City B, and City C, but if we don’t, then that can lead to structural Multicollinearity.
2. Data multicollinearity
This type of Multicollinearity exists in the data itself and is not an artifact of the model. Observational experiments tend to show this type of Multicollinearity. That means when Multicollinearity was already there at the time when data was collected.
Removing Multicollinearity
1. Domain knowledge
You can identify highly correlated pairs of independent variables by examining the correlation matrix between independent variables based on domain knowledge.
2. Draw a Scatter plot
import pandas as pddf=pd.read_csv('Iris.csv')df
In this, you can see that there is a linear relationship between these two variables, so we can say that there is Multicollinearity present here.
import matplotlib.pyplot as plt plt.scatter(df['SepalLengthCm'],df['PetalWidthCm']
3. Correlation matrix
Make a correlation Matrix as shown in the diagram. We have taken the iris data set and have written this code to make a correlation matrix
import seaborn as snsplt.figure(figsize=(8,6))iris_corr = iris_df.corr()sns.heatmap(iris_corr, annot=True)
From the graph, we can see a strong correlation between petal length and petal width and a good correlation between petal width and sepal length.
4. Variance inflation factor(VIF)
We have a very simple test to assess the Multicollinearity of a regression model. Variance Inflation Factor (VIF) identifies the correlation between independent variables and the strength of that correlation.
Statistical software calculates VIF for each independent variable. VIFs start at one and have no upper limit. A value of 1 indicates no correlation between this independent variable and the other independent variables. A VIF between 1 and 5 indicates a moderate correlation but is not strong enough to warrant corrective action. A VIF above 5 represents a critical level of Multicollinearity with poorly estimated coefficients, which will give you a predicted variable’s questionable p-values. So if the variables show VIF of more than value 5, you can delete one of the independent features.
VIF Value | Conclusion |
1 | No correlation |
1-5 | Moderate correlation |
>5 | Multicollinearity(can delete that feature) |
5. Principal Component analysis
Perform a principal component analysis (PCA). PCA is used to reduce the number of variables in your data, but it doesn’t know which ones to remove. This kind of transformation combines existing predictors such that only the most informative parts are kept. This will give you unrelated predictors at the end.
6. LASSO and Ridge regression
They are advanced regression analyzes that can handle Multicollinearity. If you know how to do a least-squares linear regression, you can do these analyzes with a little extra investigation.
How to check for multicolinearity?
You must have understand from the above topic that we can detect multicollinearity by making
- Scatter graph
- variance inflation factor
- Correlation matrix
Conclusion
It is considered one of the main problems in linear regression analysis because strong correlations between variables affect their values and change the same way other variables change. This affects the entire assembly that researchers are preparing for analysis. Therefore, it is recommended to identify possible collinearity before performing regression analysis.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio