Least Square Regression in Machine Learning

Least Square Regression in Machine Learning

7 mins read14.8K Views Comment
clickHere
Updated on Oct 31, 2023 18:40 IST

This article discusses the concept of linear regression. We have also covered least square regression in machine learning. Let’s begin!

least square regression in machine learning

 

Least Square Regression is a statistical method commonly used in machine learning for analyzing and modelling data. It involves finding the line of best fit that minimizes the sum of the squared residuals (the difference between the actual values and the predicted values) between the independent variable(s) and the dependent variable.

We can use Least Square Regression for both simple linear regression, where there is only one independent variable. Also, for multiple linear regression, where there are several independent variables. We widely use this method in a variety of fields, such as economics, engineering, and finance, to model and predict relationships between variables. Before learning least square regression, let’s understand linear regression.

Linear Regression

Linear regression is one of the basic statistical techniques in regression analysis. People use it for investigating and modelling the relationship between variables (i.e. dependent variable and one or more independent variables).

Before being promptly adopted into machine learning and data science, linear models were used as basic statistical tools to assist prediction analysis and data mining. If the model involves only one regressor variable (independent variable), it is called simple linear regression, and if the model has more than one regressor variable, the process is called multiple linear regression.

Equation of Straight Line

Let’s consider a simple example of an engineer wanting to analyze vending machines' product delivery and service operations. He/she wants to determine the relationship between the time required by a deliveryman to load a machine and the volume of the products delivered. The engineer collected the delivery time (in minutes) and the volume of the products (in a number of cases) of 25 randomly selected retail outlets with vending machines. The scatter diagram is the observations plotted on a graph.

2022_02_Equation-of-Straight-Line.jpg

Now, if I consider Y as delivery time (dependent variable), and X as product volume delivered (independent variable). Then we can represent the linear relationship between these two variables as

2022_02_linear-relationship.jpg

Okay! Now that looks familiar. Its equation is for a straight line, where m is the slope and c is the y-intercept. Our objective is to estimate these unknown parameters in the regression model such that they give minimal error for the given dataset. Commonly referred to as parameter estimation or model fitting. In machine learning, the most common method of estimation is the Least Squares method.

What is the Least Square Regression Method?

Least squares is a commonly used method in regression analysis for estimating the unknown parameters by creating a model which will minimize the sum of squared errors between the observed data and the predicted data.

Basically, it is one of the widely used methods of fitting curves that works by minimizing the sum of squared errors as small as possible. It helps you draw a line of best fit depending on your data points.

Finding the Line of Best Fit Using Least Square Regression

Given any collection of a pair of numbers and the corresponding scatter graph, the line of best fit is the straight line that you can draw through the scatter points to represent the relationship between them best. So, back to our equation of the straight line, we have:

2022_02_Finding-the-Line-of-Best-Fit-Using-Least-Square-Regression.jpg

Where,

Y: Dependent Variable

m: Slope

X: Independent Variable

c: y-intercept

Our aim here is to calculate the values of slope y-intercept and substitute them in the equation along with the values of independent variable X to determine the values of dependent variable Y. Let’s assume that we have ‘n’ data points, then we can calculate slope using the scary looking formula below:

2022_02_calculate-slope.jpg

Then, the y-intercept is calculated using the formula:

2022_02_y-intercept.jpg

Lastly, we substitute these values in the final equation Y = mX + c. Simple enough, right? Now let’s take a real-life example and implement these formulas to find the line of best fit.

Least Squares Regression Example

Let us take a simple dataset to demonstrate the least squares regression method.

2022_02_least-squares-regression-method.jpg

Step 1: The first step is to calculate the slope ‘m’ using the formula

slope formula

After substituting the respective values in the formula, m = 4.70 approximately.

Step 2: Next, calculate the y-intercept ‘c’ using the formula (ymean — m * xmean). By doing that, the value of c approximately is c = 6.67.

calculating the constant from the slope formula

Step 3: Now we have all the information needed for the equation, and by substituting the respective values in Y = mX + c, we get the following table. Using this information, you can now plot the graph.

2022_02_plot-the-graph.jpg

This way, the least squares regression method provides the closest relationship between the dependent and independent variables by minimizing the distance between the residuals (or error) and the trend line (or line of best fit). Therefore, the sum of squares of residuals (or error) is minimal under this approach.

Now let us master how the least squares method is implemented using Python.

Least Squares Regression in Python

Scenario

A rocket motor is manufactured by combining an igniter propellant and a sustainer propellant inside a strong metal housing. It was noticed that the shear strength of the bond between two propellers is strongly dependent on the age of the sustainer propellant.

Problem Statement

Implement a simple linear regression algorithm using Python to build a machine-learning model that studies the relationship between the shear strength of the bond between two propellers and the age of the sustainer propellant.

Let’s begin!

Steps

Step 1: Import the required Python libraries.


 
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Copy code

Step 2: Next step is to read and load the dataset that we are working on.


 
# Loading dataset
data = pd.read_csv('PropallantAge.csv')
data.head()
data.info()
Copy code

This gives you a preview of your data and other related information that’s good to know. Our aim now is to find the relationship between the age of sustainer propellant and the shear strength of the bond between two propellers.

Step 3 (optional): You can create a scatter plot just to check the relationship between these two variables.


 
# Plotting the data
plt.scatter(data['Age of Propellant'],data['Shear Strength'])
Copy code

Step 4: The next step is to assign X and Y as independent and dependent variables, respectively.


 
# Computing X and Y
X = data['Age of Propellant'].values
Y = data['Shear Strength'].values
Copy code

Step 5: As we calculated manually earlier, we need to compute the mean of variables X and Y to determine the values of slope (m) and y-intercept. Also, let n be the total number of data points.


 
# Mean of variables X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)
# Total number of data values
n = len(X)
Copy code

Step 6: In the next step, we will calculate the slope and the y-intercept using the formulas we discussed above.


 
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
Copy code

The above step has given us the values of m and c. Substituting them, we get,

Shear Strength = 2627.822359001296 + (-37.15359094490524) * Age of Propellant

Step 7: The above equation represents our linear regression model. Now, let’s plot this graphically. 


 
# Plotting Values and Regression Line
maxx_x = np.max(X) + 10
minn_x = np.min(X) - 10
# line values for x and y
x = np.linspace(minn_x, maxx_x, 1000)
y = c + m * x
# Ploting Regression Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Ploting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Age of Propellant (in years)')
plt.ylabel('Shear Strength')
plt.legend()
plt.show()
Copy code

Output:

2022_02_output-2.jpg

Well! That’s it! We successfully found the line of best fit and fitted it into the data points using the least square regression method in machine learning. So, now, using this, I could verify that there is a strong statistical relationship between the shear strength and the propellant age.

Run this demo in Colab – Try it yourself

google-collab

Conclusion

So, you found the curve of best fit? Now what? After obtaining the least-squares fit, a number of intriguing questions pop up, such as:

  1. How well does the equation that you found fit the data?
  2. Will this linear model be useful as a predictor in real time?
  3. Also, where assumptions are violated, such as uncorrelated errors, constant variance, etc. If yes, how serious is the effect?
  4. And many more….

We should consider all of these issues before adopting the model.

So, we hope this article on least square regression in machine learning helped you to understand the concepts.

Contributed By: Varun Dwarki

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio