Difference Between Linear and Multiple Regression

# Difference Between Linear and Multiple Regression

Vikram Singh
Assistant Manager - Content
Updated on Oct 6, 2023 16:06 IST

Linear regression examines the relationship between one predictor and an outcome, while multiple regression delves into how several predictors influence that outcome. Both are essential tools in predictive analytics, but knowing their differences ensures effective and accurate modelling. Dive in to discover the core distinctions and when to use each approach.

## What is Linear Regression?

Linear regression is a statistical method used to model the relationship between a dependent variable and one independent variable. It aims to establish a linear relationship between these variables and can be used for both prediction and understanding the nature of the relationship.

### Mathematical Equation

The mathematical representation of simple linear regression is:

Y = C0 + C1X + e

where,

• Y: Dependent Variable (target variable)
• X: Independent Variable (input variable)
• C0: Intercept (value of Y when X=0)
• C1: Slope of line
• e: Error term

### Assumptions of Linear Regression

Here are some assumption that must be satisfied for the linear regression model to be valid.

• Linearity: The relationship between the independent and dependent variables should be linear.
• Independence: Observations should be independent of each other.
• Homoscedasticity: The variance of the errors should be the same across all levels of the independent variables.
• Normality: The dependent variable is normally distributed for a fixed value of the independent variable.
• No Multicollinearity: This is more pertinent for multiple regression, where all independent variables should be independent.

### Limitations of Linear Regression

• Outliers: This can significantly impact the slope and intercept of the regression line.
• Non-linearity: Linear regression assumes a linear relationship, but this assumption may not hold in some cases.
• Correlation ≠ Causation: Just because two variables have a linear relationship doesn’t mean changes in one cause changes in the other.

## What is Multiple Regression?

Multiple regression is an extension of simple linear regression. It’s used to model the relationship between one dependent variable and two or more independent variables. The primary purpose is to understand how the dependent variable changes as the independent variables change.

### Mathematical Equation

The mathematical representation of multiple regression is:

Y = C0 + C1X1 + C2X2 + C3X3 + ….. + CnXn + e

where,

• Y: Dependent Variable (target variable)
• X1, X2, X3,…, Xn: Independent Variable (input variable)
• C0: Intercept (value of Y when X=0)
• C1, C2, C3, C4, C5, …., Cn: Slope of line
• e: Error term

### Assumptions of Multiple Regression

• Linearity: A linear relationship exists between the dependent and independent variables.
• Independence: Observations are independent of each other.
• No multicollinearity: Independent variables aren’t too highly correlated with each other.
• Homoscedasticity: Constant variance of the errors.
• No Autocorrelation: The residuals (errors) are independent.
• Normality: The dependent variable is normally distributed for any fixed value of the independent variables.

### Limitations of Multiple Regression

• Overfitting: Including too many independent variables can lead to a model that fits the training data too closely.
• Omitted Variable Bias: Leaving out a significant independent variable can bias the coefficients of other variables.
• Endogeneity occurs when an independent variable is correlated with the error term, leading to biased coefficient estimates.

Until now, you clearly understand what linear and multiple regression are, their mathematical equations, assumption, and their limitations. You also have a better understanding of how linear regression and multiple regression are different from each other. Now it’s time for an example that will give you an idea of calculating the value of linear and multiple regression using Python.

## Example of Linear and Multiple Regression

Problem Statement: Suppose we have data for a retail company. The company wants to understand how their advertising expenses in various channels (e.g., TV, Radio) impact sales.

1. Linear Regression: Predict sales using only TV advertising expenses.
2. Multiple Regression: Predict sales using both TV and Radio advertising expenses.

Step-1: Generate a random dataset

```import numpy as npimport pandas as pd # Sample data generationnp.random.seed(0)tv = 100 + 50 * np.random.rand(100)radio = 50 + 25 * np.random.rand(100)sales = 200 + 3*tv + 1.5*radio + 30*np.random.randn(100) data = pd.DataFrame({'TV': tv, 'Radio': radio, 'Sales': sales}) # show the first five resultsdata.head()Copy code```

Output

Step-2: Split the dataset into training and test dataset

```#split the data into training and testing sets from sklearn.model_selection import train_test_split train, test = train_test_split(data, test_size=0.2, random_state=0)Copy code```

Step-3: Evaluating Mean Squared Error for Linear Regression

```#Linear Regression# Using only TV expenses for prediction from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error X_train_tv = train[['TV']]y_train = train['Sales']X_test_tv = test[['TV']]y_test = test['Sales'] linear_model = LinearRegression().fit(X_train_tv, y_train)linear_pred = linear_model.predict(X_test_tv) # Evaluationlinear_rmse = np.sqrt(mean_squared_error(y_test, linear_pred))Copy code```

Step-4: Evaluating Mean Squared Error for Multiple Regression

```#Multiple Regression# Using both TV and Radio expenses for predictionX_train_multi = train[['TV', 'Radio']]X_test_multi = test[['TV', 'Radio']] multiple_model = LinearRegression().fit(X_train_multi, y_train)multiple_pred = multiple_model.predict(X_test_multi) # Evaluationmultiple_rmse = np.sqrt(mean_squared_error(y_test, multiple_pred))Copy code```

Step-5: Print the results

`# Error Metricsprint(f"Linear Regression RMSE: {linear_rmse:.2f}")print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")Copy code`

Output

Linear Regression RMSE: 27.18

Multiple Regression RMSE: 25.27

Explanation

From the above result, we have the value of RMSE for linear regression is greater than the RMSE value for multiple regression. This implies multiple regression gives a better fit to the data.
Typically adding more relevant predictors (features) can enhance a model’s performance, but you must be cautious about overfitting. Also, if the features are correlated, it can introduce multi-collinearity.

Now, let’s see how the plots of linear and multiple regression looks like:

#### Linear Regression

```# For Linear Regressionplt.scatter(X_test_tv, y_test, color='blue', label='True values')plt.scatter(X_test_tv, linear_pred, color='red', label='Predicted values')plt.xlabel('TV Expenses')plt.ylabel('Sales')plt.title('Linear Regression: TV vs Sales')plt.legend()plt.show() # Error Metricsprint(f"Linear Regression RMSE: {linear_rmse:.2f}")print(f"Multiple Regression RMSE: {multiple_rmse:.2f}")Copy code```

Output

#### Multiple Regression

```from mpl_toolkits.mplot3d import Axes3D # Setting up the 3D plotfig = plt.figure(figsize=(10, 7))ax = fig.add_subplot(111, projection='3d') # Scatter plot of actual dataax.scatter(train['TV'], train['Radio'], train['Sales'], color='blue', marker='o', alpha=0.5, label='True values') # Creating a meshgrid for the planex_surf = np.linspace(train['TV'].min(), train['TV'].max(), 100)y_surf = np.linspace(train['Radio'].min(), train['Radio'].max(), 100)x_surf, y_surf = np.meshgrid(x_surf, y_surf) # Predicting the values from the meshed gridvals = pd.DataFrame({'TV': x_surf.ravel(), 'Radio': y_surf.ravel()})predicted_sales = multiple_model.predict(vals)ax.plot_surface(x_surf, y_surf, predicted_sales.reshape(x_surf.shape), color='None', alpha=0.3) # Labeling the axesax.set_xlabel('TV Expenses')ax.set_ylabel('Radio Expenses')ax.set_zlabel('Sales')ax.set_title('Multiple Regression: Sales predicted by TV and Radio Expenses')ax.legend() plt.show()Copy code```

Output