ROC Curve in Machine Learning

# ROC Curve in Machine Learning

Vikram Singh
Assistant Manager - Content
Updated on Oct 6, 2023 10:45 IST

In machine learning, the ROC curve is a graphical representation of the True Positive Rate against the False Positive Rate. It is a powerful tool for evaluating the performance of different classification models, to choose the best threshold for a given model, and to assess the overall quality of a model.

## What is the ROC Curve in Machine Learning?

An ROC curve, (or receiver operating characteristic curve), in machine learning is a graph that shows the performance of a classification model at all classification thresholds. It is a popular and important metric for evaluating the performance of classification models, especially those used in medical diagnosis and fraud detection.

• The ROC curve is plotted with the True Positive Rate (TPR) on the y-axis and the False Positive Rate (FPR) on the x-axis.
• The TPR, also known as recall, is the percentage of positive cases that the model correctly identifies.
• TPR = TP/ (TP + FN), where TP: True Positive and FN: False Negative
• The FPR is the percentage of negative cases that the model incorrectly identifies as positive.
• FPR = FP / (FP + TN), where FP: False Positive and TN: True Negative
• A typical ROC curve looks like:
• In simple terms, ROC curve provides the tradeoff between True Positive Rate and False Positive Rate at different threshold settings.
• The ROC curve for a perfect classifier would be a diagonal line from the bottom left corner to the top right corner.
• This means that the classifier would correctly identify all positive cases and no negative cases.
• The ROC curve for a random classifier would be a horizontal line at the 50% TPR mark.
• This means that the classifier would be similar to guessing which cases are positive and which are negative.
Confusion Matrix in Machine Learning
Are you tired of your AI models getting confused? Untangle their mysteries with the Confusion Matrix, your secret weapon for accuracy! Decode True Positives, False Negatives, and more to uncover...read more
ROC-AUC vs Accuracy: Which Metric Is More Important?
ROC-AUC and Accuracy are the important metrices that are used to evaluate the machine learning model performance. In this article, we will discuss the difference between the ROC-AUC and Accuracy.
Difference Between Type 1 and Type 2 Error
Type – 1 error is known as false positive, i.e., when we reject the correct null hypothesis, whereas type -2 error is also known as a false negative, i.e., when...read more

Note: The full form of the ROC curve is the Receiver Operating Characteristic curve.

Now, let’s take a dataset, to know how to implement and intrept the ROC curve.

#### To generate a random dataset, we will use sklearn.datasets.make_classification.

make_classification has many parameters, but here, we will use the following:

• n_samples: number of observations you want to generate.
• n_features: number of numerical features you want/ number of columns
• n_informative: number of useful features/if you set the n_informative value then, n-feature-n_inforamtive columns will be redundant.
• n_class: number of unique classes for the target variable.
• random_state: it ensures that the same randomization is used each time the code is executed, resulting in the same data splits.

#### Step-1: Generate the Dataset

` `
```#generate a random dataset using the sklearn.datasets.make_classification from sklearn.datasets import make_classification X, y = make_classification ( n_samples=3000, n_features=7, n_informative=5, n_classes=2, random_state= 28)Copy code```

#### Step-2: create the data frame, and give name to columns

` `
```#create the dataframe, and name the columns import pandas as pd # Create DataFrame with features as columns df = pd.DataFrame(X) # give custom names to the features df.columns = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7'] # Now add the label as a column df['y'] = y df.info()Copy code```

Output

` `
```#show first 5 observation from the dataset df.head()Copy code```

Output

` `
```#check unique count in the target varaible df['y'].value_counts()Copy code```

Output

#### Step-3: split, train, fit the classifier and predict the probability

` `
```#split, train and fit the classifier from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifier #split the dataset into training and testing set.#Split the dataset into 70% training set and 30% testing set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=28) #train the dataset on logistic regression and KNeighborsClassifier lr = LogisticRegression(max_iter=500)knn = KNeighborsClassifier(n_neighbors=4) #fit the model lr.fit(X_train, y_train)knn.fit(X_train, y_train) #predict probabilities pred1 = lr.predict_proba(X_test)pred2 = knn.predict_proba(X_test)Copy code```

#### Step-4: Compute ROC

` `
```#compute the ROC from sklearn.metrics import roc_curve fpr1, tpr1, thresh1 = roc_curve(y_test, pred1[:,1], pos_label=1)fpr2, tpr2, thresh2 = roc_curve(y_test, pred2[:,1], pos_label=1) #ROC curve for TPR=FPRrandom_probs = [0 for i in range(len(y_test))]p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)Copy code```

#### Step-5: Draw the ROC Curve

` `
```#Draw the ROC curve import matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')plt.plot(fpr2, tpr2, linestyle='--',color='green', label='KNN')plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')#titleplt.title('ROC curve')#x-labelplt.xlabel('False Positive Rate')#y-labelplt.ylabel('True Positive rate') plt.legend(loc='best')plt.savefig('ROC',dpi=300)plt.show();Copy code```

Output

Explanation

Now, from the above ROC curve, we can easily depict that for KNN the Area Under the Curve is greater than that for the Logistic Regression curve. In simple terms, we can conclude that the KNN is better at classifying the positive class from the dataset than logistic regression.

Evaluation Metrics in Machine Learning
Evaluation metrics are the compass guiding machine learning models towards accuracy and efficiency. Dive into this article to unravel the significance of these metrics, from the classic AUC-ROC to the...read more
Difference between Null Hypothesis and Alternative Hypothesis
Null hypothesis and alternative hypothesis are the assumptions made by researchers to prove or disprove those assumptions. In this article, we will explore the difference between null and alternative hypothesis.
Sensitivity vs. Specificity: What’s the Difference?
Sensitivity and Specificity are often used interchangeably, but they are two different concepts. Sensitivity refers to the ability of a test to identify those with the disease correctly. In contrast,...read more