Tutorial – Matplotlib Histogram – Shiksha Online
When working with a machine learning project, you almost always perform data analysis and when doing that, you are bound to come across the good old histograms. A histogram represents the frequency (count) or proportion (count/total count) of cases for continuous data. Python’s data visualization library – Matplotlib, provides support for many useful charts for creating cool visualizations. For this article, we are going to look at using Matplotlib Histogram.
We will be covering the following sections:
Quick Intro to Histograms
A histogram is used to visualize the probability distribution of one-dimensional numerical data. Histograms plot the frequencies of the data instead of the values. It does that by dividing the entire range of values into a series of intervals called bins. It then counts the number of values that fall in each bin and visualizes the results intuitively.
We will understand how this is done through a fun example. The dataset used in this blog contains information on employees working for a company. We need to find out the percentage distribution of employees’ monthly income in this company.
Let’s get started!
Best-suited Python for data science courses for you
Learn Python for data science with these high-rated online courses
Installing and Importing Matplotlib
First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:
pip install matplotlib
Now let’s import the libraries we’re going to need today:
import pandas as pdimport matplotlib.pyplot as plt%matplotlib inline
In Matplotlib, pyplot is used to create figures and change their characteristics.
The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.
Creating a Matplotlib Histogram
Load the dataset
Prior to creating our graphs, let’s check out the dataset:
#Read the datasetdf = pd.read_csv('company.csv')df.head()
#Check out the number of columnsdf.shape()
There are 35 columns (or features) in this dataset. Let’s print them all:
#List all column namesprint(df.columns)
Based on our requirement, our focus is going to be on the MonthlyIncome column from the dataset.
Plotting the data
Now, let’s plot a Matplotlib histogram using the plt.hist() function:
plt.hist(df['MonthlyIncome'])
Although we can get a general idea of the distribution of employees’ income through the above plot, we cannot extract any relevant information from the histogram just yet.
Let’s add a few elements here to help us interpret the visualization in a better way.
Adding Elements to Matplotlib Histogram
The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:
- Use plt.title() for setting a plot title
- Use plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
- Use plt.legend() for the observation variables
- Use plt.show() for displaying the plot
plt.hist(df['MonthlyIncome'], label='Employees’ Income')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.title('Income Distribution')plt.legend()plt.show()
Parameters of Matplotlib Histogram
Firstly, let’s specify the bins in our graph through the bins parameter:
- If bins is an integer, it defines the number of equal-width bins in the range
- If bins is a sequence, it defines the bin edges
plt.hist(df['MonthlyIncome'], label='Employees’ Income', bins=20)plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.title('Income Distribution')plt.legend()plt.show()
- The edgecolor parameter is used to highlight the bin edges with the specified color:
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.title('Income Distribution')plt.legend()plt.show()
The histtype parameter specifies the type of histogram:
- bar (default)
- barstacked
- step
- stepfilled
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', histtype='step')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.title('Income Distribution')plt.legend()plt.show()
- The range parameter specifies the lower and upper range of the bins:
Range has no effect if bins is a sequence.
plt.hist(df['MonthlyIncome'], label='Employees’ Income', bins=20, edgecolor='r', range=[1000,25000])plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.title('Income Distribution')plt.legend()plt.show()
- Histograms are vertical by default. But you can change their orientation as using the orientation parameter:
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', orientation='horizontal')plt.title('Income Distribution')plt.xlabel('Frequency')plt.ylabel('Monthly Income')plt.legend()plt.show()
Let’s enlarge our graph to view it clearly:
- We’ll specify the figsize parameter in the figure() function to set the dimensions of the figure in inches.
plt.figure(1, figsize=(15,7)) plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', orientation='horizontal')plt.title('Income Distribution')plt.xlabel('Frequency')plt.ylabel('Monthly Income')plt.legend()plt.show()
From the above plot, we can make some inferences:
- The majority of the employees in the company are earning around $2500 monthly
- Most of the employees have a salary range of $2500-7500
- Few employees command a salary higher than $10,000
Now, what if you want to plot the distribution of monthly incomes department-wise? Let’s try doing that through histograms too!
Let’s group the dataset according to the Department column:
dept = df.groupby('Department')dept.first()
Do you see the three departments shown above? Let’s group our original dataset based on the three departments separately:
sales = dept.get_group('Sales')hr = dept.get_group('Human Resources')rd = dept.get_group('Research & Development')
Let’s plot a histogram for the ‘Sales’ department:
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales') plt.title('Income Distribution')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.legend()plt.show()
Now, let’s try plotting histograms for all three departments on a single plot:
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales')plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources')plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D') plt.title('Income Distribution of Departments')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.legend()plt.show()
Oops! The plots for ‘Sales’ and ‘Human Resources’ are hidden behind the ‘Sales’ histogram. Wouldn’t it be helpful if the histograms were see-through?
- The alpha parameter takes an integer between 0 and 1 and specifies the transparency of each histogram
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5)plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2) plt.title('Income Distribution of Departments')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.legend()plt.show()
- Let’s change the colors of the histograms using the color parameter:
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5, color = 'r')plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2) plt.title('Income Distribution of Departments')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.legend()plt.show()
The plot looks attractive as well as informative, doesn’t it now? We can again make some inferences from the above plot:
- We can infer that the R&D department has the highest number of employees within the salary range of $2000-$7000
- Few employees command salaries higher than $10,000 and even then, most of them are from the R&D department
- Human Resources department has the lowest count of employees
Saving your Matplotlib Histogram
You can save your plot as an image using savefig() function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats.
Let’s try saving the ‘Income Distribution of Departments’ plot we have created above:
fig = plt.figure(1, figsize=(15,7)) plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5, color = 'r')plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2) plt.title('Income Distribution of Departments')plt.ylabel('Frequency')plt.xlabel('Monthly Income')plt.legend() fig.savefig('histogram.png')
To view the saved image, we’ll use the matplotlib.image module, as shown below:
#Displaying the saved imageimport matplotlib.image as mpimg image = mpimg.imread("histogram.png")plt.imshow(image)plt.show()
Endnotes
Histograms take continuous numerical values and can be used to get their frequency in a dataset. Hence, it can prove to be a really efficient tool for data analysis. In fact, The Histogram is often called the “Unsung Hero of Problem-solving” because of its underutilization. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis.
Top Trending Articles:
Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst