Extract text from pdf or image in Python

This tutorial will show you how to extract text from a pdf or an image with Tesseract OCR in Python. Tesseract OCR offers a number of methods to extract text from an image and I will cover 4 methods in this tutorial. I am also going to get a specific value from an invoice by using bounding boxes.

It can be useful to extract text from a pdf or an image when we are working with machine learning. We might use pdf:s as our data source and/or want to extract certain information from a pdf or an image based on model predictions.

You will need to install Tesseract OCR and unpack poppler to be able to run the code in this tutorial, you will also need to add the path to poppler and Tesseract OCR as environment variables. Check out my previous post: Install Python and libraries, if you have difficulties with this.

Data and libraries

I am using an invoice as data source in this tutorial (download it), i am going to convert this .pdf to images and extract text from one of the images. You will need the following libraries: pandas, pdf2image and pytesseract.

Convert image to a string

I start by converting the .pdf file to images, one image per page in the file. I do not want images to be to big, but I need a satisfactory resolution (dpi=200) to be able to extract the data I want. I am also setting the size of the image, it can be good to do this if you have many pdf:s and want them all to have the same size. I save all the pages to disk and convert page 2 to a string.

# Import libraries
import pandas as pd
import pytesseract as pt
import pdf2image

# Read a pdf file as image pages
# We do not want images to be to big, dpi=200
# All our images should have the same size (depends on dpi), width=1654 and height=2340
pages = pdf2image.convert_from_path(pdf_path='files\\spcs-ob-893.pdf', dpi=200, size=(1654,2340))

# Save all pages as images
for i in range(len(pages)):
    pages[i].save('images\\spcs-ob-893_p' + str(i) + '.jpg')

# Convert a page to a string (page 2)
content = pt.image_to_string(pages[1], lang='swe')
print(content)

2(2)

[6.4 A.D Sh Faktura
. op
, se . Fakturanummer / Kundnummer Fakturadatum
Inredning för alla jobb Srena NOA Maden
893 102 2019-10-24

Leveransadress Fakturaadress
Grossisten HB Grossisten HB

Norrvägen 10 Box 45

Söråkersvägen
124 80 SÖRÅKER 124 78 SÖRÅKER

Er referens Anne Karlsson Vår referens Anders Oleby
Ert ordernr Betalningsvillkor 30 dagar netto
Leveransvillkor Fritt kund Förfallodatum 2019-11-23
Leveranssätt Speditör Dröjsmålsränta 10,00 Yo
Leveransdatum 2019-10-24
Ert VAT-nummer SE458956234901
( a
Artnr Benämning Lev ant Enh A-pris Summa
(CC MO MO M— — — — — — — — — — — VM,
116 Uppgrad Beta till Delta 1 st 3 320,00 3 320,00
117 Uppgrad Delta till Gamma 1 st 5 270,00 5 270,00
118 Uppagr Lilla Pers till Pers 1 st 3 320,00 3 320,00
150 Årsuppgrad Lilla Personaladm 3 st 2720,00 8 160,00
151 Årsuppgradering Personaladm 4 st 3 150,00 12 600,00
Netto Frakt Exp avg Exkl moms Moms ” Moms kr Öresavr ATT BETALA
446 050,00 50,00 6,00 446 106,00 25 89 902,50 0,50 536 009,00
) I
( Vår växel är öppen 08.00 - 17.00
|
Adress Tel PlusGiro
Övningsbolaget AB 012-34 56 78 17 60 99-0
Box 1 Fax Bankgiro e-post
Storgatan 1 012-34 56 80 991-2346 echo(Qvismaspes.se
12345 STORSTAD Företagets säte Organisationsnr Momsreg.nr

Storstad 555555-5555 SE555555555501
Godkänd för F-skatt

Convert image to boxes

This method will convert the image into characters and there bounding boxes. This information might be useful in some situations.

# Import libraries
import pandas as pd
import pytesseract as pt
import pdf2image

# Read a pdf file as image pages
pages = pdf2image.convert_from_path(pdf_path='files\\spcs-ob-893.pdf', dpi=200, size=(1654,2340))

# Convert a page to chars with bounding boxes (page 2)
content = pt.image_to_boxes(pages[1], lang='swe', nice=0)
print(content)

6 256 526 266 542 0
0 274 526 285 542 0
5 287 526 297 542 0
0 299 526 310 542 0
, 312 523 315 528 0
0 318 526 328 542 0
0 330 526 341 542 0
5 404 526 415 542 0
0 416 526 427 542 0
, 430 523 432 528 0
...
d 1141 132 1152 148 0
k 1147 132 1159 148 0
ä 1155 132 1164 148 0
n 1165 132 1176 148 0
d 1178 132 1200 148 0
f 1208 132 1215 148 0
ö 1215 132 1226 148 0
r 1228 132 1234 144 0
F 1242 132 1253 148 0
- 1255 137 1261 139 0
s 1262 132 1272 144 0
k 1269 132 1279 148 0
a 1274 132 1283 148 0
t 1285 132 1295 144 0
t 1297 132 1308 147 0

Convert image to hOCR

hOCR is an open standard to display text from optical character recognition (OCR) in XML or XHTML. The output gives information about the layout, classes and bounding boxes.

# Import libraries
import pandas as pd
import pytesseract as pt
import pdf2image

# Read a pdf file as image pages
pages = pdf2image.convert_from_path(pdf_path='files\\spcs-ob-893.pdf', dpi=200, size=(1654,2340))

# Convert a page to hocr (page 2)
content = pt.image_to_pdf_or_hocr(pages[1], lang='swe', nice=0, extension='hocr')

# Write content to a new file, owerwrite w or append a (b=binary)
f = open('files\\spcs-ob-893_p1.hocr', 'w+b')
f.write(bytearray(content))
f.close()

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract v5.0.0-alpha.20191030' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "C:\Users\info\AppData\Local\Temp\tess_gwr_13g3.PPM"; bbox 0 0 827 1170; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 721 48 741 58">
    <p class='ocr_par' id='par_1_1' lang='swe' title="bbox 721 48 741 58">
     <span class='ocr_caption' id='line_1_1' title="bbox 721 48 741 58; baseline 0.1 -2; x_size 20; x_descenders 5; x_ascenders 5">
      <span class='ocrx_word' id='word_1_1' title='bbox 721 48 741 58; x_wconf 68'>202)</span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_2' title="bbox 630 90 753 138">
    <p class='ocr_par' id='par_1_2' lang='swe' title="bbox 630 90 753 138">
     <span class='ocr_line' id='line_1_2' title="bbox 630 90 753 138; baseline 0 1032; x_size 61.25; x_descenders 15.3125; x_ascenders 15.3125">
      <span class='ocrx_word' id='word_1_2' title='bbox 630 90 753 138; x_wconf 95'> </span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_3' title="bbox 77 141 742 142">
    <p class='ocr_par' id='par_1_3' lang='swe' title="bbox 77 141 742 142">
     <span class='ocr_line' id='line_1_3' title="bbox 77 141 742 142; baseline 0 0; x_size 0.5; x_descenders -0.25; x_ascenders 0.25">
      <span class='ocrx_word' id='word_1_3' title='bbox 77 141 742 142; x_wconf 95'> </span>
     </span>
    </p>
   </div>
   ...
   <div class='ocr_carea' id='block_1_19' title="bbox 556 1096 654 1104">
    <p class='ocr_par' id='par_1_19' lang='swe' title="bbox 556 1096 654 1104">
     <span class='ocr_line' id='line_1_42' title="bbox 556 1096 654 1104; baseline 0 0; x_size 20; x_descenders 5; x_ascenders 5">
      <span class='ocrx_word' id='word_1_167' title='bbox 556 1096 600 1104; x_wconf 69'>Godkänd</span>
      <span class='ocrx_word' id='word_1_168' title='bbox 604 1096 617 1104; x_wconf 95'>för</span>
      <span class='ocrx_word' id='word_1_169' title='bbox 621 1096 654 1104; x_wconf 90'>F-skatt</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

Convert image to data frame

This is my favorite method as I get information about text, it’s bounding box and the confidence level. The image is converted to a data frame, I remove columns that are unnecessary and I sort the data frame. I declare a bounding box that covers the invoice number and I extract the information by looping rows in the data frame. The data frame is finally saved to a .csv file.

# Import libraries
import pandas as pd
import pytesseract as pt
import pdf2image

# Read a pdf file as image pages
pages = pdf2image.convert_from_path(pdf_path='files\\spcs-ob-893.pdf', dpi=200, size=(1654,2340))

# Convert a page to a data frame (page 2)
ds = pt.image_to_data(pages[1], lang='swe', nice=0, output_type='data.frame')

# Remove unnecessary columns in the data set
ds = ds.drop(columns=['level','page_num', 'block_num', 'par_num', 'line_num', 'word_num'])

# Sort data set top-to-down and left-to-right
ds = ds.sort_values(by=['top', 'left'], ascending=True)

# Create a bounding box around invoice number (Fakturanummer)
box = [825,220,990,260]

# Create a string
output = ''

# Loop data set and get contents inside the bounding box
for index, row in ds.iterrows():
    xmin = row['left']
    ymin = row['top']
    xmax = row['left'] + row['width']
    ymax = row['top'] + row['height']
    if(xmin >= box[0] and ymin >= box[1] and xmax <= box[2] and ymax <= box[3]):
        output += row['text']

# Print contents
print('Invoice number is: {0}'.format(output))

# Convert data frame to csv
ds.to_csv('files\\spcs-ob-893_p1.csv', index=False)