Tutorial – Box Plot in Matplotlib

# Tutorial – Box Plot in Matplotlib

Updated on Mar 4, 2022 16:12 IST

Python’s most popular visualization library – Matplotlib, provides support for many useful graphical visualizations. For this article, we are going to focus on Box Plots – a graphical technique used to examine the distribution of your data.

Exploratory Data Analysis (EDA) is an important step in data analysis when working on a machine learning or data science project. EDA helps summarize the main characteristics of your data, mostly employing data visualization methods. Let’s see how to perform EDA with a box plot in Matplotlib.

We will be covering the following sections:

## Quick Intro to Box Plots

A box plot is used to visually represent the statistical summary of an attribute element in a dataset. Each box plot displays the following:

• Minimum
• First Quartile Q1
• Median
• Third Quartile (Q3)
• Maximum
• Interquartile Range (IQR)
• Outliers, if any

Just as a median breaks the dataset in half, quartiles are used to tell us about the spread and skewness of data by breaking the dataset into quarters.

Additionally, you can choose to display the mean and standard deviation of your data. Box plots are also called ‘Box and Whisker plot’ and are particularly useful for comparing distributions across groups.

Now, we will understand how to create a box plot to analyze the distribution of variables in a given data. The dataset used in this blog can be found here. It contains information on breast cancer. The patient’s ID number along with the diagnosis is provided along with ten real-valued features for each cell nucleus.

We are going to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).

Let’s get started!

## Installing and Importing Matplotlib

First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:

` `
`pip install matplotlibCopy code`

Now let’s import the libraries we’re going to need today:

` `
`import pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineCopy code`

In Matplotlib, pyplot is used to create figures and change their characteristics.

The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.

## Creating a Box Plotin Matplotlib

Prior to creating our graphs, let’s check out the dataset:

` `
`#Read the datasetdf = pd.read_csv('data.csv')df.head()Copy code`
` `
`#Check out the number of columnsdf.shapeCopy code`

There are 33 columns in this dataset. Let’s print them all:

` `
`#List all column namesprint(df.columns)Copy code`

Based on our requirement, our focus is going to be on the diagnosis and area_mean columns from the dataset.

Plotting the data

Now, let’s plot a box plot using the plt.boxplot() function:

` `
`plt.boxplot(df['area_mean'])Copy code`

Let’s add a few elements here to help us interpret the visualization in a better way.

The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:

• plt.title() for setting a plot title
• plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
• plt.show() for displaying the plot
` `
`plt.boxplot(df['area_mean'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`

Through the above plot, we can visualize the spread of the data distribution and how values over ~1500 are outliers.

## Optional Parameters of the Box Plot in Matplotlib

• notch: This parameter is set to boolean values False or True for simple rectangular and notched plot respectively. The notches represent the confidence interval (CI) around the median:
` `
`#Parameter notchplt.boxplot(df['area_mean'], notch=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`
• vert: This parameter is set to boolean values False or True for horizontal and vertical plot respectively:

#Parameter vert

` `
`#Parameter vertplt.boxplot(df['area_mean'], vert=False)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`

• patch_artist: This parameter is set to boolean False by default and produces boxes with Line2D artist. If set to True produces a plot with Patch artists:

#Parameter patch_artist

` `
`#Parameter patch_artistplt.boxplot(df['area_mean'], patch_artist=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`
• manage: This parameter if set to boolean True by default and the tick locations and labels are adjusted to match the boxplot positions. If set to False, this happens:

#Parameter manage_ticks

` `
`#Parameter manage_ticksplt.boxplot(df['area_mean'], manage_ticks=False)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`
• showmeans: This parameter if set to boolean value True, displays plot mean:
` `
`#Parameter showmeansplt.boxplot(df['area_mean'], vert=False, showmeans=True)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`

Styling the box plot – Changing outlier marker colors

• flierprops: This parameter is a dictionary that specifies the style of the fliers or markers:
` `
`#Changing the outlier markers using parameter flierpropsdots = dict(markerfacecolor='red', marker='o')plt.boxplot(df['area_mean'], vert=False, flierprops=dots)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`

Styling the box plot – Changing mean marker colors

• meanprops: This parameter is a dictionary that specifies the style of the mean marker:
` `
`#Adding the mean using parameter markerpropsmean_shape = dict(markerfacecolor='yellow', marker='D', markeredgecolor= 'green') plt.boxplot(df['area_mean'], vert=False, showmeans=True, meanprops=mean_shape)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()Copy code`

## Grouping Box Plots in a Single Figure

Coming back to our objective, we have to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).

We can do this in two ways –

1. Using Pandas
` `
`#Plotting the boxplot using pandasdf.boxplot(column = 'area_mean', by = 'diagnosis')plt.ylabel('Tumor Area Mean Values')plt.title('')Copy code`
1. Using Matplotlib
` `
`#Creating Seriesmalignant = df[df['diagnosis']=='M']['area_mean']benign = df[df['diagnosis']=='B']['area_mean'] #Plotting the boxplot using matplotlibfig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis')Copy code`

As you can see, a larger distribution of tumors is malignant and they also have a higher area mean.

You can save your plot as an image using the savefig() function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats.

Let’s try saving the ‘Boxplot grouped by Diagnosis’ plot we have created above:

` `
`fig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis') fig.savefig('boxplot.png')Copy code`

The image would have been saved with the filename ‘boxplot.png’.

To view the saved image, we’ll use the matplotlib.image module, as shown below:

` `
`#Displaying the saved imageimport matplotlib.image as mpimg image = mpimg.imread("boxplot.png")plt.imshow(image)plt.show()Copy code`

## Outlier Detection using Box Plots

An outlier is an observation that deviates evidently from other observations in the data. Outliers can be anomalies or an error. So, to decide whether to ignore the outliers, we need to identify them first. Box plots are an excellent statistical tool to visualize outliers.

Let’s take a separate example – We have some data on testosterone levels in males as given below:

Let’s create a box plot for this data:

` `
`import pandas as pd data = {'Patients': ['Male 1','Male 2','Male 3','Male 4','Male 5','Male 6', 'Male 7','Male 8','Male 9'], 'Testosterone': [683,540,938,67,712,594,429,491,803]} df = pd.DataFrame(data)dfCopy code`

import matplotlib.pyplot as plt

plt.boxplot(df[‘Testosterone’])

As you can see, one outlier is clearly visible in this box plot which can easily be removed. The exact value of the outlier is not known from the plot, but we know that it is lower than 200. So, let’s filter the outlier value:

` `
`import matplotlib.pyplot as plt plt.boxplot(df['Testosterone'])Copy code`

There! The box plot detects the value 67 as an outlier in the dataset. Whether this outlier is an anomaly or not is a different question that has to be answered separately using additional techniques and having domain knowledge.

## Endnotes

Box plot is an underrated tool that can summarize a lot of information about your data through a single visualization. When performing exploratory data analysis (EDA), box plots can be a great complement to histograms. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis. Interested in learning more about Data Visualization using Python? Explore related articles here.