Image-to-text generation with BLIP-2

Image-to-text generation with BLIP-2

7 mins read987 Views Comment
Updated on Jul 10, 2023 10:46 IST

In this article we will learn about Image-to-text generation with BLIP-2. You will also learn the Architecture of Blip 2 and Python Code to Convert an Image to Text.


Optical character recognition (OCR), often known as an image-to-text conversion, is a process that converts text-containing images into machine-encoded text. In the modern digital era, when photographs are the main information source for many applications, OCR is an essential tool. One of the most recent OCR innovations, Blip 2, has completely changed how images are converted into text. We will go into the inner workings of Blip 2 technology and examine the idea of OCR in this post.In this article we will lean about Image-to-text generation with BLIP-2.

Table of contents

What is OCR?

OCR is a technology that transforms printed or handwritten text into a machine-readable format to be electronically processed, indexed and searched. OCR has a wide range of applications in sectors including banking, healthcare, government, and retail, where the capacity to extract and analyze data from documents and images is essential. The OCR (Optical Character Recognition) engine Blip 2 uses deep learning to extract text from photos.

The versatility of Blip 2 in handling various sorts of photos and text is one of its main benefits. Text in many languages, including English, Chinese, Japanese, Korean, and more, can be recognized. Additionally, it can handle photographs with various scales and orientations and images with noisy or low-contrast backgrounds.

Creating Incredible AI-Powered Art With Dall-E 2
Creating Incredible AI-Powered Art With Dall-E 2
Delve into the realm of DALL-E 2, OpenAI's groundbreaking AI art generator. Discover how it transforms text descriptions into vivid, high-resolution images with its CLIP and GLIDE models. Unleash more
Stable Diffusion: Is it the Next Best Image Generation Tool?
Stable Diffusion: Is it the Next Best Image Generation Tool?
All about Stable Diffusion is mentioned in this blog. By the end, you will also gain some insights on how it has been trained.
Midjourney AI Image Generator Pauses its Free Trials
Midjourney AI Image Generator Pauses its Free Trials
Midjourney halted its free trials due to standard internet issues and extraordinary demand of users creating deep fakes. Let’s explore!

Summary of OCR

The earliest OCR systems used light sensors to identify individual characters on printed papers, and the technology has been around since the 1950s. OCR technology has advanced through the years to accommodate a variety of languages and character sets, including handwritten writing. Preprocessing images, text recognition, and postprocessing are the three steps commonly comprising the OCR process.

Preprocessing entails adjusting the input image’s brightness, contrast, and orientation while removing noise. To recognize characters and their locations, text recognition entails examining the preprocessed image. In postprocessing, the output text is improved by fixing typos and formatting it to match the layout of the original content.

OCR technology can be used for various tasks, including digitizing printed documents, processing invoicing, extracting data from forms, and assisting people with visual impairments to read text.

Beginning of Blip 2

Facebook AI Research created the deep learning-based OCR tool Blip 2. (FAIR). Modern deep-learning techniques are used in Blip 2 to increase the accuracy of text recognition, building on the success of Blip 1. For OCR research and development, Blip 2 is an open-source software framework that offers an adaptable and modular platform.

The Architecture of Blip 2

The design of Blip 2 is made up of four primary parts: a beam search decoder, a connectionist temporal classification (CTC) loss function, a recurrent neural network (RNN), and a convolutional neural network (CNN) for extracting visual features.

The CNN part of Blip 2 takes an input image and generates several feature maps that identify the key elements in the image. These feature maps are fed into the RNN component, which creates a series of probability distributions for the output character set. The difference between the predicted and actual character sequences is calculated using the CTC loss function. Based on a predetermined set of rules, the beam search decoder chooses the most likely sequence of characters from the output probability distributions.

A Step-by-Step Guide for Using BLIP2 and Python Code to Convert an Image to Text

The optical character recognition (OCR) method turns text-filled photographs into editable text files. OCR can be used for various tasks, including automatic data entry, translation, and digitizing printed materials. This article will teach you how to convert an image to text in Python using the free and open-source OCR engine BLIP2.

Make sure you have the necessary software installed before we start:

  • Python 3.6 or higher
  • Pillow
  • BLIP2
  • NumPy

Step 1 of Pillow BLIP2 NumPy: Install the Required Libraries

Installing the required libraries is mandatory before we can proceed. In this lesson, we’ll use the following libraries:

Pillow: This library opens, manipulates, and saves images.

BLIP2: This is the OCR engine we will use to convert the image to text.

NumPy: This library is used for numerical computing.

You can install these libraries using pip by running the following command:

Python code:

pip install pillow blip2 numpy
Copy code

Step 2: Loading the Image

The below-given images are used for the input


The image we wish to convert to text will be loaded at this stage. Any image with text can be used, including screenshots, photos, and scanned documents. We can make use of Pillow’s Image module to load the image.

Python code:

from PIL import Image
image_path = "path/to/image.jpg"
image =
Copy code

In the code above, replace “path/to/image.jpg” with the path to the image you want to convert.

Step 3: Preprocessing the Image

The image needs to be preprocessed to improve the text before OCR can be performed on it. Preprocessing methods are used to enhance the image’s quality and increase the OCR engine’s ability to read the text. We’ll use several preprocessing methods on the image in this stage.

Python code:

import numpy as np
# Convert the image to grayscale
gray = image.convert('L')
# Convert the image to a NumPy array
img_array = np.array(gray)
# Invert the image
img_array = 255 - img_array
# Threshold the image
threshold = 100
img_array[img_array < threshold] = 0
img_array[img_array >= threshold] = 255
# Convert the NumPy array back to an image
Copy code

The code above performs the following operations:

  • Convert the image to grayscale
  • Convert the image to a NumPy array
  • Invert the image
  • Threshold the image
  • Convert the NumPy array back to an image

The convert() function of the Image class is used in the first line of code to convert the image to grayscale. This is done because processing an image in grayscale is simpler than processing an image in colour.

Next, we use the NumPy library’s array() method to transform the grayscale image into a NumPy array.

By deducting each pixel value from 255, we flip the image. This is done because, in most images, the text is darker than the backdrop; yet, when the image is inverted, the text is made brighter than the background.

The image is then thresholded by assigning pixel values to 0 for any values below a particular threshold and 255 for any values above the threshold. This is done to make the text more visible to the OCR engine.

Finally, we use the fromarray() method of the Image class to transform the NumPy array back into an image.

Handling Categorical Variables with One-Hot Encoding
Handling Categorical Variables with One-Hot Encoding
Handling categorical variables with one-hot encoding involves converting non-numeric categories into binary columns. Each category becomes a column with ‘1’ for the presence of that category and ‘0’ for others, more
One hot encoding vs label encoding in Machine Learning
One hot encoding vs label encoding in Machine Learning
As in the previous blog, we come to know that the machine learning model can’t process categorical variables. So when we have categorical variables in our dataset then more
Handling missing values: Beginners Tutorial
Handling missing values: Beginners Tutorial
We take data from sometimes sources like, sometimes we collect from different sources by doing web scrapping containing missing values in it. But do you think 

Step 4: Running BLIP2 OCR

After preprocessing, we can now use BLIP2 to do OCR on the image. Wide-ranging languages and fonts can be recognized using the OCR engine BLIP2. Installing the blipocr package is required to utilize BLIP2 in Python.

Python code:

!pip install blipocr
Copy code

After installing the package, we can use the blipocr module to perform OCR on the preprocessed image.

Python code:

from blipocr import BlipOcr
# Initialize the OCR engine
ocr_engine = BlipOcr()
# Perform OCR on the image
text = ocr_engine.ocr_image(processed_image)
Copy code

The BlipOcr class from the blipocr module is first imported in the code above. After that, we build a class instance and save it in the ocr_engine variable.

Finally, we invoke the BlipOcr class’ ocr_image() method, passing the preprocessed image as an input. The text that we keep in the text variable is the text that the method returns that has been read by the OCR engine.

Step 5: Postprocessing the Text

The OCR engine’s output may include typographical, character, or formatting mistakes. We must use postprocessing procedures to increase the text’s correctness. We’ll post process the text using several fundamental methods in this stage.

Python code:

import re
# Remove non-alphanumeric characters
text = re.sub(r'W+', ' ', text)
# Remove leading and trailing whitespace
text = text.strip()
Copy code

The re module is employed in the code above to eliminate non-alphanumeric characters. To do this, we invoke the sub() method while giving a regular expression that matches all characters other than alphanumeric. These characters are swapped out for a single space.

The strip() method is then used to eliminate any leading and trailing whitespace from the text.

Step 6: File the Text Saving

The recognized text can now be saved to a file. We can use Python’s file I/O operations to write the content to a file to accomplish this.

Python code:

output_file = "path/to/output.txt"
with open(output_file, "w") as f:
Copy code

Change “path/to/output.txt” in the code above to the path and filename you wish to store the text to. Using the open() function, we open the file in write mode, passing the file path and mode as arguments.

The file object’s write() method is then used to write the text to the file. Finally, we use the close() method to close the file.

The below-given images are the output of the input images.


This is the output of the first image.



Converting images to text using OCR (Optical Character Recognition) can be a valuable tool in various applications, such as digitizing printed documents, extracting text from images for analysis or translation, and automating data entry processes. Python provides several libraries for image processing and OCR, and in this article, we have explored how to use the open-source OCR engine BLIP2 to convert an image to text.

Author-Vishwa Kiran

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio