# Tutorial – Box Plot in Matplotlib

Python’s most popular visualization library – Matplotlib, provides support for many useful graphical visualizations. For this article, we are going to focus on Box Plots – a graphical technique used to examine the distribution of your data.

Exploratory Data Analysis (EDA) is an important step in data analysis when working on a machine learning or data science project. EDA helps summarize the main characteristics of your data, mostly employing data visualization methods. Let’s see how to perform EDA with a *box plot in Matplotlib.*

We will be covering the following sections:

- Quick Intro to Box Plots
- Installing and Importing Matplotlib
- Creating a Matplotlib Box Plot
- Adding Elements to the Box Plot
- Optional Parameters of the Box Plot
- Grouping Box Plots in a Single Figure
- Saving your Box Plot
- Outlier Detection using Box Plots

**Quick Intro to Box Plots**

A box plot is used to visually represent the statistical summary of an attribute element in a dataset. Each box plot displays the following:

- Minimum
- First Quartile Q1
- Median
- Third Quartile (Q3)
- Maximum
- Interquartile Range (IQR)
- Outliers, if any

Just as a median breaks the dataset in half, quartiles are used to tell us about the spread and skewness of data by breaking the dataset into quarters.

Additionally, you can choose to display the mean and standard deviation of your data. Box plots are also called ‘Box and Whisker plot’ and are particularly useful for comparing distributions across groups.

Now, we will understand how to create a box plot to analyze the distribution of variables in a given data. The dataset used in this blog can be found here. It contains information on breast cancer. The patient’s ID number along with the diagnosis is provided along with ten real-valued features for each cell nucleus.

We are going to analyze the relationship between a **categorical feature** (diagnosis: malignant or benign tumor) and a **continuous numerical feature** (area_mean).

Let’s get started!

**Installing and Importing Matplotlib**

First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:

pip install matplotlib

Now let’s import the libraries we’re going to need today:

import pandas as pdimport matplotlib.pyplot as plt%matplotlib inline

In Matplotlib, **pyplot** is used to create figures and change their characteristics.

The **%matplotlib** **inline** function allows for plots to be visible when using Jupyter Notebook.

**Creating a Box Plot** **in Matplotlib**

*Load the dataset*

Prior to creating our graphs, let’s check out the dataset:

#Read the datasetdf = pd.read_csv('data.csv')df.head()

#Check out the number of columnsdf.shape

There are 33 columns in this dataset. Let’s print them all:

#List all column namesprint(df.columns)

Based on our requirement, our focus is going to be on the *diagnosis* and *area_mean *columns from the dataset.

*Plotting the data*

Now, let’s plot a box plot using the **plt.boxplot()** function:

plt.boxplot(df['area_mean'])

Let’s add a few elements here to help us interpret the visualization in a better way.

**Adding Elements to the ****Box Plot** **in Matplotlib**

**Box Plot**

**in Matplotlib**

The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:

**plt.title()**for setting a plot title**plt.xlabel()**and**plt.ylabel()**for labeling x and y-axis respectively**plt.show()**for displaying the plot

plt.boxplot(df['area_mean'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

Through the above plot, we can visualize the spread of the data distribution and how values over ~1500 are outliers.

**Optional Parameters of the ****Box Plot** **in Matplotlib**

**Box Plot**

**in Matplotlib**

*notch*: This parameter is set to boolean values False or True for simple rectangular and notched plot respectively. The notches represent the confidence interval (CI) around the median:

#Parameter notchplt.boxplot(df['area_mean'], notch=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

#Parameter vertplt.boxplot(df['area_mean'], vert=False)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

#Parameter patch_artistplt.boxplot(df['area_mean'], patch_artist=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

#Parameter manage_ticksplt.boxplot(df['area_mean'], manage_ticks=False)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

*showmeans*: This parameter if set to boolean value True, displays plot mean:

#Parameter showmeansplt.boxplot(df['area_mean'], vert=False, showmeans=True)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

*Styling the box plot – Changing outlier marker colors*

*flierprops*: This parameter is a dictionary that specifies the style of the fliers or markers:

#Changing the outlier markers using parameter flierpropsdots = dict(markerfacecolor='red', marker='o')plt.boxplot(df['area_mean'], vert=False, flierprops=dots)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

*Styling the box plot – Changing mean marker colors*

*meanprops*: This parameter is a dictionary that specifies the style of the mean marker:

#Adding the mean using parameter markerpropsmean_shape = dict(markerfacecolor='yellow', marker='D', markeredgecolor= 'green') plt.boxplot(df['area_mean'], vert=False, showmeans=True, meanprops=mean_shape)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()

**Grouping Box Plots in a Single Figure**

Coming back to our objective, we have to analyze the relationship between a **categorical feature** (*diagnosis: malignant or benign tumor*) and a **continuous numerical feature** (*area_mean*).

We can do this in two ways –

- Using Pandas

#Plotting the boxplot using pandasdf.boxplot(column = 'area_mean', by = 'diagnosis')plt.ylabel('Tumor Area Mean Values')plt.title('')

- Using Matplotlib

#Creating Seriesmalignant = df[df['diagnosis']=='M']['area_mean']benign = df[df['diagnosis']=='B']['area_mean'] #Plotting the boxplot using matplotlibfig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis')

As you can see, a larger distribution of tumors is malignant and they also have a higher area mean.

**Saving your Box Plot**

You can save your plot as an image using the** savefig()** function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats.

Let’s try saving the ‘Boxplot grouped by Diagnosis’ plot we have created above:

fig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis') fig.savefig('boxplot.png')

The image would have been saved with the filename ‘boxplot.png’.

To view the saved image, we’ll use the **matplotlib.image** module, as shown below:

#Displaying the saved imageimport matplotlib.image as mpimg image = mpimg.imread("boxplot.png")plt.imshow(image)plt.show()

**Outlier Detection using Box Plots**

An outlier is an observation that deviates evidently from other observations in the data. Outliers can be anomalies or an error. So, to decide whether to ignore the outliers, we need to identify them first. Box plots are an excellent statistical tool to visualize outliers.

Let’s take a separate example – We have some data on testosterone levels in males as given below:

Patients |
Testosterone Levels in ng/dL |

Male 1 | 683 |

Male 2 | 540 |

Male 3 | 938 |

Male 4 | 67 |

Male 5 | 712 |

Male 6 | 594 |

Male 7 | 429 |

Male 8 | 491 |

Male 9 | 803 |

Let’s create a box plot for this data:

import pandas as pd data = {'Patients': ['Male 1','Male 2','Male 3','Male 4','Male 5','Male 6', 'Male 7','Male 8','Male 9'], 'Testosterone': [683,540,938,67,712,594,429,491,803]} df = pd.DataFrame(data)df

import matplotlib.pyplot as plt

plt.boxplot(df[‘Testosterone’])

As you can see, one outlier is clearly visible in this box plot which can easily be removed. The exact value of the outlier is not known from the plot, but we know that it is lower than 200. So, let’s filter the outlier value:

import matplotlib.pyplot as plt plt.boxplot(df['Testosterone'])

There! The box plot detects the value 67 as an outlier in the dataset. Whether this outlier is an anomaly or not is a different question that has to be answered separately using additional techniques and having domain knowledge.

**Endnotes**

Box plot is an underrated tool that can summarize a lot of information about your data through a single visualization. When performing exploratory data analysis (EDA), box plots can be a great complement to histograms. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis. Interested in learning more about Data Visualization using Python? Explore related articles here.

**About the Author**

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio