Data Visualization using Matplotlib – A Beginner’s Guide

Data Visualization using Matplotlib – A Beginner’s Guide

7 mins read1.7K Views Comment
Updated on Sep 18, 2022 13:21 IST

Data Visualization is a way of summarizing data visually. Huge amounts of data are being collected all around you at all times of the day – whether it’s through surveys or social media tracking or even the transactions you’re making. The data provides useful insights for businesses and visualizations make it easier to identify trends and patterns in text-based data. 

2022_02_matplotlib-data-visualization-2.jpg

“Humans are visual creatures. Half of the human brain is directly or indirectly devoted to processing visual information.”

Visualizations are the easiest way to analyze and intake information. Data Visualization also gives way to high-level data analysis in Exploratory Data Analysis (EDA) and Machine Learning (ML).

In this blog on Data Visualization using Matplotlib, we will be covering the following sections:

Introduction to Matplotlib

Matplotlib library is used to create static 2D plots, although it does have some support for 3D visualizations. It makes producing both simple and advanced plots straightforward and intuitive. It can be used in Python scripts, Jupyter notebook, and web application servers.

Let’s understand how to use Matplotlib practically with a fun example. The dataset used in this blog can be found here. It contains information on the global happiness survey in the year 2021. This data describes how measurements of well-being can be used effectively to assess the progress of nations.

Installing and Importing Matplotlib

Let’s start with installing the library in your working environment first. Execute the following command in your terminal:

 
pip install matplotlib
Copy code

Now let’s import the libraries we’re going to need today:

 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Copy code

In Matplotlib, pyplot is used to create figures and change their characteristics.

The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.

The Relation between – Matplotlib, PyPlot, and Python

  • Python is the programming language used popularly for mathematical and statistical analysis. It works on most of the platforms and has multiple libraries for data manipulation, transformation, and visualization.
  • In Python, data visualizations can be done via many libraries. The most popular and widely used of them all, and the one we’re discussing in this blog, is the Matplotlib library. In fact, many of the other libraries utilize attributes of Matplotlib to display the plots they generate. 
  • PyPlot is a module in Matplotlib which provides a MATLAB-like interface. MATLAB is licensed software, whereas PyPlot is an open-source module that provides similar functionality.

Creating a Simple Plot

Load the dataset

Prior to creating our graphs, let’s check out the dataset

 
#Read the dataset
df = pd.read_csv('world-happiness-report-2021.csv')
df.head()
Copy code
Graphical user interface

Description automatically generated with medium confidence
 
#List all column names
for col_name in df.columns:
print(col_name)
Copy code
Text

Description automatically generated

Here, Ladder score is basically the happiness score, explained by six factors.

Dystopia is a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors.

Now, let’s move ahead with analyzing this dataset through Data Visualization using Matplotlib.

Line Graphs/Plots

These are basically the simplest graphs you can create using Matplotlib. Let’s create one to analyze the relationship between a country’s happiness score and life expectancy.

Here, the data points will plotted y (‘Healthy life expectancy’) versus x (‘Ladder score’) using plot() function:

 
#Create Series
expectancy = df['Healthy life expectancy']
score = df['Ladder score']
plt.plot(score, expectancy)
Copy code
Chart, line chart

Description automatically generated

A basic line plot is generated as shown above. 

Adding Elements in a Plot

The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to interpret it better:

  • Use plt.title() for setting a plot title
  • Use plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
  • Use plt.legend() for the observation variables
  • Use plt.show() for displaying the plot
 
plt.plot(score, expectancy)
plt.title('Happiness Plot')
plt.xlabel('Happiness Score')
plt.ylabel('Age')
plt.legend(['Healthy Life Expectancy'])
plt.show()
Copy code
Chart, line chart

Description automatically generated

Can you see how a labeled graph vastly improves its readability? One can easily point out how life expectancy is overall increasing with a higher happiness score. 

There are many more elements we can experiment with when creating graphs: 

 
#Add color, style, width to line element
plt.plot(score, expectancy, color = 'green', linestyle = '--', linewidth=1.2)
Copy code
Chart, scatter chart

Description automatically generated
 
#Add color, style, width to line element
plt.plot(score, expectancy, color = 'green', linestyle = '--', linewidth=1.2)
Copy code
Chart, line chart

Description automatically generated
 
#Add grid using grid() method
plt.grid(True)
plt.plot(score, expectancy)
Copy code

The plots can be customized based on the following attributes:

Making Multiple Plots in One Figure

The plots can be customized based on the following attributes:

Let’s compare the GDP and life expectancy of countries against their happiness score. For comparison, we’ll need to plot ‘happiness score vs GDP’ and ‘happiness score vs life expectancy’ in a single figure. Let’s see how to do that:

 
#Create Series for GDP
gdp = df['Logged GDP per capita']
plt.plot(score, expectancy)
plt.plot(score, gdp)
plt.title('Happiness Score vs GDP and Life Expectancy')
plt.xlabel('Happiness Score')
plt.legend(['Life Expectancy','GDP'])
plt.show()
Copy code
Graphical user interface, chart

Description automatically generated

From this graph, we can also visually identify a trend – both GDP per capita and life expectancy have higher values than for countries with higher happiness scores. 

If you want to display the plots in separate figures, use plt.show() after each plot statement as shown below:

 
plt.plot(score, expectancy)
plt.title('Happiness Score vs Life Expectancy')
plt.xlabel('Happiness Score')
plt.show()
plt.plot(score, gdp, color ='orange')
plt.title('Happiness Score vs GDP')
plt.xlabel('Happiness Score')
plt.show()
Copy code
Chart, line chart, histogram

Description automatically generated

Through these separate graphs, we can see that when there is a spike/dip for GDP per capita for a given score, there is also a spike/dip for life expectancy for the same score. 

Creating Subplots

We use pyplot.subplots to create a figure and a grid of subplots with a single call. For example, for the previous scenario, we could create subplots using the following lines of codes:

 
#Creating two subplots
fig, axs = plt.subplots(2)`
fig.suptitle('Vertically stacked subplots')
axs[0].plot(score, expectancy)
axs[1].plot(score, gdp, color = 'orange')
Copy code
A picture containing chart

Description automatically generated

Figure Objects

The matplotlib.figure is a module in Matplotlib that provides the figure object, which contains all the plot elements. This module controls the default spacing of the subplots. matplotlib.figure.Figure() class is the top-level container for the plot elements. It returns the figure instances.

plt.figure() is used to create the empty figure object in Matplotlib. It has the following additional parameters:

  • figsize:  Figure dimension (width, height) in inches
  • dpi: Dots per inch
  • facecolor: Figure patch facecolor
  • edgecolor: Figure patch edge color
  • linewidth: Linewidth of the frame

Let’s create a figure object:

 
#Creating a figure object fig
fig=plt.figure(figsize=(10,4), facecolor ='green', edgecolor='r',linewidth=5)
plt.plot(score, expectancy)
plt.show()
Copy code
Chart, line chart

Description automatically generated

Axes Objects

The matplotlib.axes is a module that contains most of the figure elements: AxisTickLine2DTextPolygon, etc., and sets the coordinate system.

The matplotlib.axes.Axes() class supports callbacks through func(ax) where ax is the axes instance.

  • Use add_axes() to add axes to the figure 
  • Use ax.set_title() for setting title
  • Use ax.set_xlabel() and ax.set_ylabel() for setting x and y-label respectively 

Let’s see how to add axes to our figure:

 
#Creating a figure object fig
fig = plt.figure()
#Adding the axes
ax = fig.add_axes([0,0,2,1])
plt.plot(score, expectancy)
plt.plot(score, gdp)
ax.legend(labels = ('Healthy life expectancy', 'GDP per capita'), loc = 'upper left')
ax.set_title("Usage of add_axes function")
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
plt.show()
Copy code
Chart, line chart

Description automatically generated

Different Types of Plots

Matplotlib provides a wide variety of plot formats to support various methods of visualizations. Let’s go through a few most popular methods of Data Visualization using Matplotlib:

Bar Graphs/Plots

These graphs represent data through bars. bar() function is used to plot a bar graph. Can be plotted vertically (default) or horizontally (using barh() function).

Typically used with categorical variables. However, they can also be used with numerical variables, as we’ll see in our case. These graphs work with discrete values, so we’ll convert our Ladder score to integer type. 

 
#Converting to int
HappinessScore = score.apply(int)
#Counting the number of times each score occurs – the height of the bars
count = HappinessScore.value_counts()
#Score of each count – X-axis
HapScore = count.index
#Plotting the bar graph
plt.bar(HapScore, count)
plt.title('Happiness Score')
plt.xlabel('Score')
plt.ylabel('Count')
plt.show()
Copy code
Chart, bar chart, histogram

Description automatically generated

Histograms

Similar to bar graphs, Histograms display the frequency/count values in discrete intervals called bins. Frequency count is kept on y-axis whereas intervals on x-axis. hist() function is used to plot histograms, as shown below:

 
plt.hist(score)
plt.title('Happiness Score Distribution')
plt.xlabel('Happiness Score')
plt.ylabel('Frequency')
plt.show()
Copy code
Chart, histogram

Description automatically generated

Technically, the happiness score takes a continuous range of values, so we can get a general idea of the score distribution through the above histogram. Though we can’t get exact data figures just by looking at them.

Boxplots

A boxplot is used to display data distribution through quartiles. Quartiles (Q1, Q2, Q3, Q4) are basically the division of data into four equal groups or intervals. A median separates the lower half and upper half of the data.

The function used for the scatter plot is boxplot(). Used to detect outliers in data and how tightly the data is grouped.

Chart, box and whisker chart

Description automatically generated

Let’s see if our Ladder score has any outlier data:

 
plt.boxplot(df['Ladder score'])
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Hardly, there’s just one little circle outside the Minimum.

Scatter Plots

Scatter plots often reveal relationships or associations between two numerical variables. The function used for the scatter plot is scatter(), as shown below:

 
plt.scatter(gdp, score)
plt.title('GDP vs Happiness Score')
plt.xlabel('GDP per Capita')
plt.ylabel('Happiness Score')
plt.show()
Copy code
Chart, scatter chart

Description automatically generated

As expected, the higher the score for GDP per Capita, the higher is the happiness score of a certain country.

Saving Plots

Let’s try saving the scatter plot we have created above:

 
fig = plt.figure()
plt.scatter(gdp, score)
plt.title('GDP vs Happiness Score')
plt.xlabel('GDP per Capita')
plt.ylabel('Happiness Score')
fig.savefig('scatterplot.png')
Copy code

The image would have been saved with the filename ‘saveimage.png’. 

To view the saved image, we’ll use the matplotlib.image module, as shown below:

 
#Displaying the saved image
import matplotlib.image as mpimg
image = mpimg.imread("scatterplot.png")
plt.imshow(image)
plt.show()
Copy code
Chart, scatter chart

Description automatically generated

Data Visualization using Matplotlib – Demo – Try it Yourself

Click the below colab icon to run the above explained demo

google-collab

Endnotes

Matplotlib is one of the oldest Python data visualization libraries, and thanks to its wealth of features and ease of use it is still one of the most widely used ones. Matplotlib was first released back in 2003 and has been continuously updated since. Hope this article helped to understand the concepts of Data Visualization using Matplotlib.


Top Trending Articles:

Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio