Difference Between Linear and Multiple Regression

Difference Between Linear and Multiple Regression

5 mins read10.3K Views Comment
Vikram
Vikram Singh
Assistant Manager - Content
Updated on Oct 6, 2023 16:06 IST

Linear regression examines the relationship between one predictor and an outcome, while multiple regression delves into how several predictors influence that outcome. Both are essential tools in predictive analytics, but knowing their differences ensures effective and accurate modelling. Dive in to discover the core distinctions and when to use each approach.

2022_04_linear-Regression-1.jpg

Table of Content

Difference Between Linear Regression and Multiple Regression: Linear Regression vs Multiple Regression

Parameter Linear (Simple) Regression Multiple Regression
Definition Models the relationship between one dependent and one independent variable. Models the relationship between one dependent and two or more independent variables.
Equation Y = C0 + C1X + e Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e
Complexity Simpler dealing with one relationship. More complex due to multiple relationships.
Use Cases Suitable when there is one clear predictor. Suitable when multiple factors affect the outcome.
Assumptions Linearity, Independence, Homoscedasticity, Normality Same as linear regression, with the added concern of multicollinearity.
Visualization Typically visualized with a 2D scatter plot and a line of best fit. Requires 3D or multi-dimensional space, often represented using partial regression plots.
Risk of Overfitting Lower, as it deals with only one predictor. Higher, especially if too many predictors are used without adequate data.
Multicollinearity Concern Not applicable, as there’s only one predictor. A primary concern; having correlated predictors can affect the model’s accuracy and interpretation.
Applications Basic research, simple predictions, understanding a singular relationship. Complex research, multifactorial predictions, studying interrelated systems.

What is Linear Regression?

Linear regression is a statistical method used to model the relationship between a dependent variable and one independent variable. It aims to establish a linear relationship between these variables and can be used for both prediction and understanding the nature of the relationship.

Mathematical Equation

The mathematical representation of simple linear regression is:

Y = C0 + C1X + e

where,

  • Y: Dependent Variable (target variable)
  • X: Independent Variable (input variable)
  • C0: Intercept (value of Y when X=0)
  • C1: Slope of line
  • e: Error term

Assumptions of Linear Regression

Here are some assumption that must be satisfied for the linear regression model to be valid.

  • Linearity: The relationship between the independent and dependent variables should be linear.
  • Independence: Observations should be independent of each other.
  • Homoscedasticity: The variance of the errors should be the same across all levels of the independent variables.
  • Normality: The dependent variable is normally distributed for a fixed value of the independent variable.
  • No Multicollinearity: This is more pertinent for multiple regression, where all independent variables should be independent.

Limitations of Linear Regression

  • Outliers: This can significantly impact the slope and intercept of the regression line.
  • Non-linearity: Linear regression assumes a linear relationship, but this assumption may not hold in some cases.
  • Correlation ≠ Causation: Just because two variables have a linear relationship doesn’t mean changes in one cause changes in the other.

What is Multiple Regression?

Multiple regression is an extension of simple linear regression. It’s used to model the relationship between one dependent variable and two or more independent variables. The primary purpose is to understand how the dependent variable changes as the independent variables change.

Mathematical Equation

The mathematical representation of multiple regression is:

Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e

where,

  • Y: Dependent Variable (target variable)
  • X1, X2, X3,…, Xn: Independent Variable (input variable)
  • C0: Intercept (value of Y when X=0)
  • C1, C2, C3, C4, C5, …., Cn: Slope of line
  • e: Error term

Assumptions of Multiple Regression

  • Linearity: A linear relationship exists between the dependent and independent variables.
  • Independence: Observations are independent of each other.
  • No multicollinearity: Independent variables aren’t too highly correlated with each other.
  • Homoscedasticity: Constant variance of the errors.
  • No Autocorrelation: The residuals (errors) are independent.
  • Normality: The dependent variable is normally distributed for any fixed value of the independent variables.

Limitations of Multiple Regression

  • Overfitting: Including too many independent variables can lead to a model that fits the training data too closely.
  • Omitted Variable Bias: Leaving out a significant independent variable can bias the coefficients of other variables.
  • Endogeneity occurs when an independent variable is correlated with the error term, leading to biased coefficient estimates.

Until now, you clearly understand what linear and multiple regression are, their mathematical equations, assumption, and their limitations. You also have a better understanding of how linear regression and multiple regression are different from each other. Now it’s time for an example that will give you an idea of calculating the value of linear and multiple regression using Python.

Example of Linear and Multiple Regression

Problem Statement: Suppose we have data for a retail company. The company wants to understand how their advertising expenses in various channels (e.g., TV, Radio) impact sales.

  1. Linear Regression: Predict sales using only TV advertising expenses.
  2. Multiple Regression: Predict sales using both TV and Radio advertising expenses.

Step-1: Generate a random dataset


 
import numpy as np
import pandas as pd
# Sample data generation
np.random.seed(0)
tv = 100 + 50 * np.random.rand(100)
radio = 50 + 25 * np.random.rand(100)
sales = 200 + 3*tv + 1.5*radio + 30*np.random.randn(100)
data = pd.DataFrame({'TV': tv, 'Radio': radio, 'Sales': sales})
# show the first five results
data.head()
Copy code

Output

2023_10_datafram.jpg

Step-2: Split the dataset into training and test dataset


 
#split the data into training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=0)
Copy code

Step-3: Evaluating Mean Squared Error for Linear Regression


 
#Linear Regression
# Using only TV expenses for prediction
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X_train_tv = train[['TV']]
y_train = train['Sales']
X_test_tv = test[['TV']]
y_test = test['Sales']
linear_model = LinearRegression().fit(X_train_tv, y_train)
linear_pred = linear_model.predict(X_test_tv)
# Evaluation
linear_rmse = np.sqrt(mean_squared_error(y_test, linear_pred))
Copy code

Step-4: Evaluating Mean Squared Error for Multiple Regression


 
#Multiple Regression
# Using both TV and Radio expenses for prediction
X_train_multi = train[['TV', 'Radio']]
X_test_multi = test[['TV', 'Radio']]
multiple_model = LinearRegression().fit(X_train_multi, y_train)
multiple_pred = multiple_model.predict(X_test_multi)
# Evaluation
multiple_rmse = np.sqrt(mean_squared_error(y_test, multiple_pred))
Copy code

Step-5: Print the results


 
# Error Metrics
print(f"Linear Regression RMSE: {linear_rmse:.2f}")
print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")
Copy code

Output

Linear Regression RMSE: 27.18

Multiple Regression RMSE: 25.27

Explanation

From the above result, we have the value of RMSE for linear regression is greater than the RMSE value for multiple regression. This implies multiple regression gives a better fit to the data.
Typically adding more relevant predictors (features) can enhance a model’s performance, but you must be cautious about overfitting. Also, if the features are correlated, it can introduce multi-collinearity.

Now, let’s see how the plots of linear and multiple regression looks like:

Linear Regression


 
# For Linear Regression
plt.scatter(X_test_tv, y_test, color='blue', label='True values')
plt.scatter(X_test_tv, linear_pred, color='red', label='Predicted values')
plt.xlabel('TV Expenses')
plt.ylabel('Sales')
plt.title('Linear Regression: TV vs Sales')
plt.legend()
plt.show()
# Error Metrics
print(f"Linear Regression RMSE: {linear_rmse:.2f}")
print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")
Copy code

Output

2023_10_linearregression-2.jpg

Multiple Regression


 
from mpl_toolkits.mplot3d import Axes3D
# Setting up the 3D plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot of actual data
ax.scatter(train['TV'], train['Radio'], train['Sales'], color='blue', marker='o', alpha=0.5, label='True values')
# Creating a meshgrid for the plane
x_surf = np.linspace(train['TV'].min(), train['TV'].max(), 100)
y_surf = np.linspace(train['Radio'].min(), train['Radio'].max(), 100)
x_surf, y_surf = np.meshgrid(x_surf, y_surf)
# Predicting the values from the meshed grid
vals = pd.DataFrame({'TV': x_surf.ravel(), 'Radio': y_surf.ravel()})
predicted_sales = multiple_model.predict(vals)
ax.plot_surface(x_surf, y_surf, predicted_sales.reshape(x_surf.shape), color='None', alpha=0.3)
# Labeling the axes
ax.set_xlabel('TV Expenses')
ax.set_ylabel('Radio Expenses')
ax.set_zlabel('Sales')
ax.set_title('Multiple Regression: Sales predicted by TV and Radio Expenses')
ax.legend()
plt.show()
Copy code

Output

2023_10_multiple-regression.jpg
About the Author
author-image
Vikram Singh
Assistant Manager - Content

Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio