How to Calculate Adjusted R-Squared

How to Calculate Adjusted R-Squared

4 mins read90 Views Comment
clickHere
Vikram
Vikram Singh
Assistant Manager - Content
Updated on Nov 16, 2023 15:26 IST

Ever wondered how well your regression model truly fits your data, especially when multiple variables come into play? Adjusted R-squared—a metric that goes beyond traditional R-squared to offer deeper insights. But what makes it different from R-squared? This article will discuss all.

2023_10_What-is-4.jpg

In the previous article, we discussed how to calculate the r-squared value for the machine learning algorithm. In this article, we will discuss another evaluation metric, i.e., adjusted r-squared, and will also discuss some examples to know why we need adjusted r-squared.
But before that let’s have a quick introduction of r-squared.

Table of Content

What is R-Squared?

R-squared, also known as the coefficient of determination, describes the proportion of the variance in a dependent variable explained by an independent variable or variable in a linear regression model. 

It is calculated by dividing the explained variation by the total variation or 1- (Unexplained Variation/Total Variation).

Mathematical Formula of R-Squared

R-Squared = 1- (SSR/SST)

where, 

SSR: Sum of Squared Residual (The sum of Squared Error)

SST: Total sum of squares (sum of squared deviation from the mean)

Note:

  • The value of R-squared ranges between 0 and 1.
  • 0 means that the model doesn’t explain any variation in the dependent variable.
  • 1 means that the model explains all the variations.

Limitations of R-Squared

  • The value of r-squared will increase as the number of independent variables are added, regardless of whether they are relevant or not. This can lead to overfitting.
  • It is not the best metric for comparing models, especially when the models have a different number of predictors.
  • A high value of r-squared doesn’t necessarily mean the model is adequate.
  • R-squared is highly sensitive to outliers. A few outliers can significantly decrease the value of the R-squared value.

Why we need adjusted R-Squared?

As we mentioned earlier, the value of the adjusted r-squared increases if new variables are added. It doesn’t matter whether the added variable is correlated or not. To overcome this, an adjusted R-squared metric comes into existence that provides a more accurate measure of the model’s goodness of fit.

As the word suggests, adjusted r-squared adjusts for the number of predictors in the model, ensuring that only significant predictors enhance its value. 

It penalizes the model for the inclusion of irrelevant predictors. This makes it a more robust metric, especially when evaluating the model with various predictors.

Adjusted R-Squared Formula

Adjusted R-Squared = 1- [(1 – R2) (n – 1)/ (n – k – 1)]

where,

n: number of data points

k: number of independent variables

R: R-squared value

Interpretation of Adjusted R-Squared Formula

  • If the value of the R-squared doesn’t increase significantly with the addition of a new independent variable, then the value of the adjusted R-squared value will decrease.
  • If the value of the R-squared significantly increases with adding a new independent variable, the value of the adjusted R-squared will also increase.

Note: It is recommended to use adjusted r-squared when multiple variables exist in the regression model. This would allow us to compare models with different numbers of independent variables.

Until now, you have a clear understanding of what adjusted r-squared is, its formula, and the need of adjusted r-squared over r-squared to evaluate the performance of machine learning model.

How to Calculate the Adjusted R-Squared?

Problem Statement: Create a dataset, build two linear regression models (simple linear regression model and multiple regression model) and then calculate the value of R2 and adjusted R2 in both the cases.

Solution

Step-1: Create a Sample dataset


 
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Create a synthetic dataset
np.random.seed(0)
n_samples = 100
StudyHours = np.random.uniform(1, 10, n_samples)
Extracurricular = np.random.randint(0, 5, n_samples)
FinalExamScores = 50 + 3 * StudyHours + 2 * Extracurricular + np.random.normal(0, 5, n_samples)
# Create a DataFrame from the data
data = pd.DataFrame({'StudyHours': StudyHours, 'Extracurricular': Extracurricular, 'FinalExamScores': FinalExamScores})
data.head()
Copy code

Output

2023_10_dataframe_adjusted_r_squared.jpg

Step-2: Split the data into predictors (X) and target (Y)


 
# Split the data into predictors (X) and target (y)
X = data[['StudyHours', 'Extracurricular']]
y = data['FinalExamScores']
Copy code

Step-3: Create a Linear Regression Model with one Predictor


 
# Create and fit a simple linear regression model with one predictor (StudyHours)
model_simple = LinearRegression()
model_simple.fit(X[['StudyHours']], y)
y_pred_simple = model_simple.predict(X[['StudyHours']])
# Calculate R-squared for the simple model
mse_simple = mean_squared_error(y, y_pred_simple)
r_squared_simple = 1 - (mse_simple / np.var(y))
# Calculate Adjusted R-squared for the simple model
n = len(y)
p_simple = 1 # Number of predictors in the simple model
adjusted_r_squared_simple = 1 - (1 - r_squared_simple) * (n - 1) / (n - p_simple - 1)
# Print R-squared and Adjusted R-squared values for both models
print("Simple Model:")
print(f"R-squared: {r_squared_simple:.4f}")
print(f"Adjusted R-squared: {adjusted_r_squared_simple:.4f}\n")
Copy code

Output

2023_10_output-of-simple-linear-regression-1.jpg

Step-4: Create a Linear Regression Model with one Predictor


 
# Create and fit a more complex linear regression model with two predictors (StudyHours and Extracurricular)
model_complex = LinearRegression()
model_complex.fit(X, y)
y_pred_complex = model_complex.predict(X)
# Calculate R-squared for the complex model
mse_complex = mean_squared_error(y, y_pred_complex)
r_squared_complex = 1 - (mse_complex / np.var(y))
# Calculate Adjusted R-squared for the complex model
p_complex = 2 # Number of predictors in the complex model
adjusted_r_squared_complex = 1 - (1 - r_squared_complex) * (n - 1) / (n - p_complex - 1)
print("Complex Model:")
print(f"R-squared: {r_squared_complex:.4f}")
print(f"Adjusted R-squared: {adjusted_r_squared_complex:.4f}")
Copy code

Output

2023_10_output-of-multiple-regression-model.jpg

Explnation

From the above, we get the value of R-square and adjusted r-squared increases significantly with the addition of one more variable (“Extracurricular”). This implies that the added variable has some correlation with the predictor and the target variable.

Difference Between R-Squared and Adjusted R-Squared

Parameter R-Squared Adjusted R-Squared
Definition Proportion of variance in the dependent variable explained by the independent variable(s). R-Squared adjusted for the number of predictors in the model.
Value Range Between 0 and 1. Can be negative, but typically between 0 and 1.
Response to Adding Predictors Always increases or remains the same. Can increase or decrease based on the usefulness of the added predictor.
Purpose Measures overall goodness of fit. Measures goodness of fit while accounting for model complexity.
Calculation R-Squared = 1- (SSR/SST) Adjusted R-Squared = 1- [(1 – R2) (n – 1)/ (n – k – 1)]
Best for Simple linear regression with one predictor. Multiple regression models with several predictors.
Interpretation Higher value indicates more variance explained by the model. Higher value indicates a better fit, especially when comparing models with different numbers of predictors.

 

About the Author
author-image
Vikram Singh
Assistant Manager - Content

Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio

Comments