All About Train Test Split

  All About Train Test Split

7 mins read7.5K Views Comment
Updated on Aug 25, 2023 10:13 IST

Train test split technique is used to estimate the performance of machine learning algorithms which are used to make predictions on data not used to train the model. In this article you will learn the importance of Train test split technique and its implementation in python.

2022_12_MicrosoftTeams-image-7.jpg

One of the most important aspects of machine learning is training. The act of providing data to the algorithm training is vital in creating algorithms that perform tasks efficiently and effectively. The training can be time consuming and difficult to perform if using your own computer that’s where the train test split comes in handy.

A train test is the way of structuring your machine learning project so that you can test your hypothesis quickly and inexpensively. Basically it’s a way to divide the training data so that you can try your algorithm to one half and evaluate the result on the other half.

This article will tell you, how to use the train_test_split Sklearn function to split machine learning data into training and test sets.

Table of contents

Train Test Split

When it comes to data analysis, you can split your data into training and testing sets. 

A train test split is when you split your data into a training set and a testing set. The training set is used for training the model, and the testing set is used to test your model. This allows you to train your models on the training set, and then test their accuracy on the unseen testing set. There are a few different ways to do a train test split, but the most common is to simply split your data into two sets. For example 80% for training and 20% for testing. This ensures that both sets are representative of the entire dataset, and gives you a good way to measure the accuracy of your models.

2022_12_MicrosoftTeams-image-6.jpg

Syntax of Train Test Split

Before continuing, please note that in order to use this feature, you must first import it. 

 
from sklearn.model_selection import train_test_split
Copy code

After importing the function as above, call it as train_test_split() .

2022_12_MicrosoftTeams-image-4.jpg

Within the brackets, specify the name of the ‘X’ input data as the first argument. This data should contain characteristic data. Optionally, you can also specify the name of the ‘y’ record containing the label or target record.

Difference Between Precision and Recall
Difference Between Precision and Recall
Discover the key differences between Precision and Recall in our latest article. Dive into examples and Python programming to understand how these metrics, based on relevance, measure the percentage of...read more
K-fold Cross-validation
K-fold Cross-validation
Cross-validation is a resampling technique used to validate machine learning models against a limited sample of data. In this article we will talk about K-fold Cross-validation and its advantages and...read more
Ensemble learning: Beginners tutorial
Ensemble learning: Beginners tutorial
The ensemble methods in machine learning combine the insights obtained from multiple learning models to facilitate accurate and improved decisions. In this article we will focus on ensemble learning and...read more

Parameters of Train Test Split

  • X
  • y
  • test_size
  • random_state
  • shuffle
  • Stratify
  1. X (required)

The X argument is the input array containing the feature data (i.e. the variables/columns used to build the model).

This object must be two-dimensional. So for a one dimensional numpy array you may need to reshape the data. This is usually done with code .reshape(-1,1) .

1. Y

The y argument usually contains a vector of target values ​​(that is, data targets or labels).

2. TEST_SIZE

You can specify the size of the output test set using the test_size parameter. The arguments for this parameter can be integers or floating point numbers.

If the argument is an integer, the test set size will be that number.

If the argument is a float, it must be between 0 and 1 and the number represents the percentage of observations in the test set.

3. RANDOM_STATE

If you specify an integer as the argument for this parameter, train_test_split will shuffle the data in the same order before splitting. That means different information is sent to “X_train”, “X_test”, “y_train” and “y_test”.

4. SHUFFLE

By default this is set to shuffle=True. That is, by default the data is randomized before splitting, so observations are randomly assigned to the training and test data. Setting shuffle = False disables random sorting and splits the data in the order it already exists.

Note: If you set shuffle = False, you must set stratify = None. layer

5. STRATIFY

This stratification parameter performs a split so that the ratio of values ​​in the generated samples is the same as the ratio of values ​​provided to the parameter stratification.

By default this is set to Stratify = None. 

When to Use Train Test Split

This method is not suitable when the amount of data available is small. This is because there is not enough data in the training dataset for the model to learn an effective mapping from inputs to outputs after the dataset is split into training and testing datasets. Also, the test set does not contain enough data to effectively evaluate the model’s performance. The estimated performance may be too optimistic (good) or too pessimistic (bad).

  1. In the absence of sufficient data, a suitable alternative model evaluation procedure is the k-fold cross-validation procedure. 
  2. Projects can contain efficient models and large datasets, but sometimes you need a quick estimate of your model’s performance. The train-test-split method is also used in this situation.
  3. If the dataset is imbalanced then also train test split is not a good option.In that case first you need to balance the data first.

Also read: Difference between Regression and Classification Algorithms

How to Evaluate Train Test Split

There are a few things to consider when evaluating a train test split. The first is the size of the training set. The larger the training set, the more accurate the model will be. However, if the training set is too large, it can take longer to train the model and may overfit the data. The second thing for consideration is the size of the test set. The larger the test set, the more reliable the results will be. However, if the test set is too large, it can take longer to run and may not be representative of all data. Finally, consider the balance of each class in both sets. If one class is much larger than another, it can skew the results.

Also read: Evaluating a machine learning algorithm

Evaluation Metric for Train Test Split

1. Mean Square Error

The evaluation metric for this is mean squared error (or MSE for short). This is a metric that takes the actual value and produces a squared value. This squared value is the mean squared error between the model’s prediction and the actual value.

2. R Squared Error

The best metric for the train test split is the one that balances generalization with accuracy on the testing set. This metric is called R-sq loss. R-sq loss takes both the model’s accuracy and its generalization into account, and produces a single number for judging a model’s performance.

3. Accuracy

Accuracy refers to how close a predicted result is to an actual value. That is, how close the actual measured value is to the standard value. The closer the measurement, the higher the accuracy.

Also read:Difference between Accuracy and Precision

Explore: How to Calculate R squared in Linear Regression

Explore: R-squared vs. adjusted R-squared

Train Test Split Python

We have to predict the species of the fish.

About Data Set

This dataset contains seven species of fish data for market sale. We will perform it using python. The dataset is freely available on Kaggle.

1. Speciesspecies name of fish

2. Weight- the weight of fish in Gram

3. Length1- vertical length in cm

4. Length2- diagonal length in cm

5. Length3- cross length in cm

6. Height- height in cm

7. Width- diagonal width in cm

Importing Libraries and Reading Dataset

 
import pandas as pd
import numpy as np
df=pd.read_csv('Fish.csv')
df
Copy code
2022_12_image-8.jpg

Dependent and Independent Variables

 
#Get Target data
y = dataset[‘Species’]
#Load X Variables into a Pandas Dataframe with columns
X = dataset.drop(['Species'], axis = 1)
Copy code

Splitting Dataset Into Training and Testing Set

 
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Copy code

Next, split both x and y into training and testing sets with the help of the train_test_split() function. In this training data set is 0.8, which means 80%. Random _state =0 which means there is no randomness in the data.

Now after this we can implement any algorithm.Like in this we are implementing Logistic regression.

Importing and Fitting Logistic Regression to the Training Set

 
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LogisticRegression
regressor = LogisticRegression()
model=regressor.fit(X_train,y_train)
Copy code

Also read: Linear Regression vs Logistic Regression

Predicting the Test Set Results

 
# Predicting the test set results
y_pred = model.predict(X_test)
print(y_pred)
Copy code

Checking the Accuracy

 
from sklearn.metrics import accuracy_score,confusion_matrix
score= accuracy_score(y_test,y_pred)
score
Copy code

Output 0.83333

Conclusion

As you can see, the train test split is an important technique in machine learning. It allows you to train a model on the training set and then test its accuracy on the testing set. This allows you to get a general sense of how well your model is performing, and also tells you whether or not your model is performing as expected. Now that you know how to do a train test split in Python, you can apply this technique to any machine learning problem you might encounter.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio