How to choose the Value of k in K-fold Cross-Validation

# How to choose the Value of k in K-fold Cross-Validation

Updated on Jun 14, 2022 11:16 IST

Many of you must be having queries regarding the value of k in KFold cross-validation. Let’s unravel the mystery.

In the previous article, we talked about Cross-validation and its different techniques. But in this article, we will understand how to set the value of k in K-fold cross-validation by working on a cancer dataset? We will find different accuracy scores(corresponding to k values).

Cross-validation is a technique for evaluating a machine learning model and testing its performance. It is used commonly in applied ML tasks. It helps in comparing and selecting an appropriate model.CV tends to have a lower bias than other methods. To know more about Cross-validation and its different techniques explore: Cross-validation techniques.

## Here’s how to set the value of K In K-fold cross-validation…

Choose the value of ‘k’ such that the model doesn’t suffer from high variance and high bias. In most cases, the choice of k is usually 5 or 10, but there is no formal rule. However, the value of k relies upon the size of the dataset. The runtime of the cross-validation algorithm and the computational cost with large values of k.Let’s understand this with python code by implementing different classifiers like Decision tree, random forest, and SVM.

You can read this blog for more understanding:

Cross-validation techniques
Bias and Variance with Real-Life Examples
This blog revolves around bias and variance and its tradeoff. These concepts are explained with respect to overfitting and underfitting with proper examples.

Suppose want to classify that cancer as Benign or malignant.

### 1. Import Libraries

```from numpy import mean
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np```

All the libraries are imported. We are going to use LogisticRegression in this example.

```df= pd.read_csv('/content/cancer_dataset.csv')
```

### 3.Independent And dependent features

```### Independent And dependent features
X=df.iloc[:,2:]
y=df.iloc[:,1]
X=X.dropna(axis=1)```

Y

0      M

1      M

2      M

3      M

4      M

..

564    M

565    M

566    M

567    M

568    B

Name: diagnosis, Length: 569, dtype: object

X has independent features and y have dependent feature.

### 4.Splitting the dataset into train and test

```#Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
```

Dataset is split into training and test data using train_test_split.

test_size=0.30 means train data is 70% and test data is 30%.

### 5.Define folds to test the values of k in the given range

```# define folds to test the values of k in the given range
folds = range(2,31)
```

We want to check accuracies till k=30. So defined the range here.

### 6.Evaluating the model using a given test condition

```# evaluate the model using a given test condition
def evaluate_model(cv):
# get the dataset
###  Independent And dependent features
X=df.iloc[:,2:]
y=df.iloc[:,1]
X=X.dropna(axis=1)
X
# get the model
model = LogisticRegression()
# evaluate the model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# return scores
return mean(scores), scores.min(), scores.max()
```

Next, evaluate_model(cv) is used to evaluate the model on the dataset by dividing the data into independent and dependent features.Implemented LogisticRegression().

cross_val_score() is used to calculate the score.This function returns the mean classification accuracy as well as the min and max accuracy scores from the folds.

• n_jobs=-1 represents the number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 and -1 means 100% usage of the CPU(one of the cores).
• cv=Determines the cross-validation splitting strategy.

Note: Your results may vary according to the evaluation procedure and stochastic nature of the algorithm or differences in numerical precision. Consider running the example a few times and comparing the average outcome.

### 7.Evaluating each k value

```# evaluate each k value
for k in folds:
# define the test condition
cv = KFold(n_splits=k, shuffle=True, random_state=10)
# record mean and min/max of each set of results
k_mean, k_min, k_max = evaluate_model(cv)
# report performance
print('-> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
```

Applied KFold cross-validation. Mean, min, and max accuracy for each k value that was evaluated. The random state is used as a seed to the random number generator. This parameter ensures that the generation of random numbers is in the same order.

Output:

```k_max))
folds=2, accuracy=0.935 (0.905,0.965)
-> folds=3, accuracy=0.944 (0.932,0.963)
-> folds=4, accuracy=0.935 (0.887,0.965)
-> folds=5, accuracy=0.939 (0.895,0.974)
-> folds=6, accuracy=0.942 (0.895,0.968)
-> folds=7, accuracy=0.947 (0.915,0.975)
-> folds=8, accuracy=0.937 (0.859,0.986)
-> folds=9, accuracy=0.951 (0.906,0.984)
-> folds=10, accuracy=0.946 (0.860,1.000)
-> folds=11, accuracy=0.939 (0.846,1.000)
-> folds=12, accuracy=0.954 (0.872,1.000)
-> folds=13, accuracy=0.944 (0.864,1.000)
-> folds=14, accuracy=0.942 (0.854,1.000)
-> folds=15, accuracy=0.947 (0.868,1.000)
-> folds=16, accuracy=0.941 (0.861,1.000)
-> folds=17, accuracy=0.941 (0.853,1.000)
-> folds=18, accuracy=0.946 (0.875,1.000)
-> folds=19, accuracy=0.947 (0.833,1.000)
-> folds=20, accuracy=0.940 (0.828,1.000)
-> folds=21, accuracy=0.947 (0.815,1.000)
-> folds=22, accuracy=0.942 (0.846,1.000)
-> folds=23, accuracy=0.949 (0.840,1.000)
-> folds=24, accuracy=0.946 (0.833,1.000)
-> folds=25, accuracy=0.941 (0.783,1.000)
-> folds=26, accuracy=0.940 (0.818,1.000)
-> folds=27, accuracy=0.947 (0.857,1.000)
-> folds=28, accuracy=0.944 (0.750,1.000)
-> folds=29, accuracy=0.942 (0.750,1.000)
-> folds=30, accuracy=0.942 (0.789,1.000)
```

Here we got different accuracies for different values of k. Now which one to choose? We will choose accuracy=0.954 or 95.4% which we got at k=12. As we are getting the higher accuracy at this value. So we will choose k=12 in this case.

Note: Accuracy score and k value will vary with different classifiers and different cross-validation techniques.

## Endnotes

I hope this blog answered the query regarding the value of k in KFold cross-validation. If you liked this blog please consider hitting the stars below for my motivation.

Top Trending Articles:
Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.