Popular PDF Parsing Tools for Efficient Document Extraction

Extracting Text From PDFs

Popular PDF Parsing Tools for Efficient Document Extraction

Introduction

In today's digital age, it's common for businesses and individuals to be overwhelmed by large collections of documents in various formats, including PDFs. Extracting valuable information from these documents efficiently can be a challenging task. However, with the advancement of document parser tools, this process has become significantly easier. In this blog post, we will examine some popular options and provide a comparison to help you make an informed decision and choose the best tool for your needs.

Sample PDF :

  • This is a sample PDF we are using to compare

1. PYPDF :

pypdf is a user-friendly and open-source Python library for manipulating PDFs as well as extracting text from PDF documents. It's perfect for simple tasks and can be easily used on your local machine. Plus, it's free to use and readily available for implementation. With pypdf, you can extract text quickly and efficiently, making it a valuable tool for developers and individuals alike.

LINK:Learn more about PYPDF

Code To use

import pypdf

def pypdf_extract(pdf_path):
  """Extracts text using pypdf"""
  with open(pdf_path, 'rb') as pdf_file:
    pdf_reader = pypdf.PdfReader(pdf_file)
    text = ''
    for page_num in range(len(pdf_reader.pages)):
      page = pdf_reader.pages[page_num]
      text +="\n"+ page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False)
    return text

Extracted Text From Sample PDF:

                    A)    Different     Languages:
           1.English: The quickbrown  foxjumps over thelazy dog
       2.Spanish:  Elrápido zorromarrón saltasobre elperro perezoso.
                 3.Hindi: तेजभूरालोमड़ी सु�त कु�ेपरकूदताहै।
    4. French: Le rapiderenard brun sautepar-dessus lechien paresseux.
                 5. Arabic: لوﺳﻛﻟابﻠﻛﻟاقوﻓزﻔﻘﯾﻊﯾرﺳﻟاﻲﻧﺑﻟابﻠﻌﺛﻟا.
             6.Bengali: �ত  বাদািমখরেগাশঅলস  ��েরর উপর  লাফায়।
7.Russian:  Быстрая  коричневая  лиса перепрыгивает  через ленивую  собаку.
  8.Portuguese:  A rápidaraposa marrom  saltasobre o cachorro preguiçoso.
                9.Urdu: ۔ﮯﮨ ﺎﺗدوﮐرﭘواﮯﮐﮯﺗﮐتﺳﺳ یڑﻣوﻟیروﮭﺑزﯾﺗ
              10.Mandarin  Chinese:  快 速 的 棕 色 狐 狸 跳 过 懒 狗。
                               B)    Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳󰻷😏😒󰻶😞😔😟😕🙁☹😣😖😫😩🥺
             😢😭󰷹😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
                                C)   Table:
                            Name     Occupation
                            John      Engineer
                            Emily     Manager
                            Sarah     Designer
        D) Image:
E) Hand   Written (image)

Extraction on Different Formats of Data:

  • Normal English Text ✅

  • Popular Languages text ✅

  • Emoji ✅

  • Tabular Data ✅

  • Image Data ❌

  • Handwritten Text ❌

  • Free ✅

2. Llama Parse :

LlamaParse is a tool that helps computers understand complicated documents, especially PDFs with tables and charts. It's like a super-powered decoder ring that unlocks the information hidden inside these documents. This lets you build systems that can search and answer questions about them, even the tricky ones. You can try out LlamaParse yourself through their public preview to see how well it works.

LINK:Learn more about LlamaParse

Code To Use

import nest_asyncio
from llama_parse import LlamaParse

def llama_parse_extract(pdf_path):
  nest_asyncio.apply()
  parser = LlamaParse(
  api_key=  # LLAMA_CLOUD_API_KEY : you can get it from website (it is free now)
  result_type="text",  # "markdown" and "text" are available
  )

  data= parser.load_data(pdf_path)
  return data[0].text

Extracted Text From Sample PDF:

                          A)     Different Languages:


              1.English: The quick brown fox jumps over the lazy dog
          2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
                     3. Hindi: तेज भूरा लोमड़ी सु▯त कु▯े पर कूदता है।
      4. French: Le rapide renard brun saute par-dessus le chien paresseux.
                      5. Arabic: ولﺳﻛﻟا بﻠﻛﻟا وقﻓ زﻔﻘﯾ ﻊﯾرﺳﻟا ﻲﻧﺑﻟا بﻠﻌﺛﻟا.
                6. Bengali: ▯ত বাদািম খরেগাশ অলস ▯▯েরর উপর লাফায়।
 7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
   8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
                    9. Urdu: ۔ےہ ﺎﺗودک رپاو ےک ےﺗک تﺳﺳ ڑیﻣوﻟ وریﮭﺑ زﯾﺗ
                  10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。


                                      B)     Emoji:


😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳󰻷😏😒󰻶😞😔😟😕🙁☹😣😖😫😩🥺
                 😢😭󰷹😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓


                                       C)     Table:


                                  Name       Occupation


                                   John       Engineer


                                   Emily      Manager


                                  Sarah       Designer
---
                       D) Image:
Al DEMOS                  Blog
                FutureSmart AI                         Socials
                               Submit ToolContact usExplore
 Discover Best Al Tools with
               Video Demos
             E) Hand Written (image)
      Testing PDf to text extaqcting tools
       For df ferent     tondftions-

Extraction on Different Formats of Data:

  • Normal English Text ✅

  • Popular Languages text ✅

  • Emoji ✅

  • Tabular Data ✅

  • Image Data ✅ (Accuracy is low: some data might not extracted)

  • Handwritten Text ✅ (Accuracy is low: some data might not be extracted)

  • Free ✅ (API key is free for trial. Check the plans for more info)

3. PDFMiner :

PDFMiner is a user-friendly and open-source Python library for extracting text from PDF documents. Unlike pypdf, PDFMiner is primarily focused on the task of extracting text from PDF documents. it's free to use, works offline and readily available for implementation. With PDFMiner, you can extract text quickly with better accuracy with better customization. Also, simplify your document parsing process.

LINK:Learn more about PDFMiner

Code To Use

!pip install pdfminer.six

from pdfminer.high_level import extract_text

def pdfminer_extract(pdf_path):
  """Extracts text using pdfminer.six."""
  with open(pdf_path, 'rb') as pdf_file:
    text = extract_text(pdf_file)
    return text

Extracted Text From Sample PDF:

A) Different Languages:

1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: तेज भूरा लोमड़ी सु त कु  े पर कू दता है।
4. French: Le rapide renard brun saute par-dessus le chien paresseux.

زﻔﻘﯾ

بﻠﻛﻟا

لوﺳﻛﻟا

5. Arabic:

قوﻓ
6. Bengali:  ত বাদািম খরেগাশ অলস   েরর উপর লাফায়।
7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
ﮯﮐ
10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。

9. Urdu:

.
بﻠﻌﺛﻟا

یروﮭﺑ

یڑﻣوﻟ

ﻊﯾرﺳﻟا

تﺳﺳ

ﺎﺗدوﮐ

ﻲﻧﺑﻟا

ﮯﺗﮐ

رﭘوا

۔ﮯﮨ

زﯾﺗ

B) Emoji:

😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳󰻷😏😒󰻶😞😔😟😕🙁☹😣😖😫😩🥺
😢😭󰷹😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓

C) Table:

Name

Occupation

John

Engineer

Emily

Manager

Sarah

Designer

D) Image:

E) Hand Written (image)

Extraction on Different Formats of Data:

  • Normal English Text ✅

  • Popular Languages text ✅

  • Emoji ✅

  • Tabular Data ❌ (Extract text but unable to extract proper order or format, may need to customise)

  • Image Data ❌

  • Handwritten Text ❌

  • Free ✅

4. pdf plumber :

A PDF plumber is a user-friendly and open-source high-level Python library built on top of PDFMiner.Six. It offers a simpler interface for extracting text and metadata from PDFs. Also detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

LINK:Learn more about PDFPlumber

Code To Use

import pdfplumber

def pdfplumber_extract(pdf_path):
  """Extracts text using pdfplumber."""
  with pdfplumber.open(pdf_path) as pdf:
    text = ''
    for page in pdf.pages:
      text += page.extract_text()
    return text

Extracted Text From Sample PDF:

A) Different Languages:
1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: तजे भरू ा लोमड़ी स ु त कु े पर कूदता है।
4. French: Le rapide renard brun saute par-dessus le chien paresseux.
5. Arabic: لوﺳﻛﻟا بﻠﻛﻟا قوﻓ زﻔﻘﯾ ﻊﯾرﺳﻟا ﻲﻧﺑﻟا بﻠﻌﺛﻟا.
6. Bengali:  ত বাদািম খরেগাশ অলস   েরর উপর লাফায়।
7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
9. Urdu: ۔ﮯﮨ ﺎﺗدوﮐ رﭘوا ﮯﮐ ﮯﺗﮐ تﺳﺳ یڑﻣوﻟ یروﮭﺑ زﯾﺗ
10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。
B) Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳󰻷😏😒󰻶😞😔😟😕🙁☹😣😖😫😩🥺
😢😭󰷹😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
C) Table:
Name Occupation
John Engineer
Emily Manager
Sarah DesignerD) Image:
E) Hand Written (image)

Extraction on Different Formats of Data:

  • Normal English Text ✅

  • Popular Languages text ✅

  • Emoji ✅

  • Tabular Data ✅ (Extract Data row-wise)

  • Image Data ❌

  • Handwritten Text ❌

  • Free ✅

5. AWS Textract :

AWS Textract is a machine learning service offered by Amazon Web Services (AWS) that can automatically extract text, handwriting, and other data from scanned documents not only recognizes text but also understands and extracts specific data from documents like forms and tables.

Some of the key features of AWS Textract:

  • Automating data entry: Extracting data from invoices, receipts, tax forms, and other business documents.

  • Creating intelligent search indexes: Enabling efficient search of scanned documents by indexing the extracted text.

  • Improving document processing workflows: Streamlining tasks like loan processing, insurance claims, and legal document review.

LINK:Learn more about AWS Textract

Code To Use

import boto3
from dotenv import load_dotenv
import os
from PIL import Image
import fitz  # PyMuPDF

AWS_ACCESS_KEY_ID=  # Give the access key
AWS_SECRET_ACCESS_KEY= # Give Secret Access key

# Create a boto3 session with access keys
session = boto3.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
textract_client = session.client('textract',region_name='us-east-1')


def textract_extract(pdf_path):
    """Extracts text using Amazon Textract with access keys."""
    import io

    pdf_document = fitz.open(pdf_path)
    pages = ""

    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        image_data = page.get_pixmap().tobytes()

        with io.BytesIO(image_data) as img_buffer:
            response = textract_client.detect_document_text(Document={'Bytes': img_buffer.read()})

# We can take other data also but for simplicity I have took only word and line
# specifiy "pages" variable as output you wanted as i need only text so i took it
# as a sting if you want in json you can modify accordingly


        blocks = response['Blocks']
        text = ""

        for block in blocks:
            if block['BlockType'] in ['WORD']:
                text += block['Text'] + ' '
        pages+="/n"+text  # Add newline for readability

    return pages

Extracted Text From Sample PDF:

NOTE: Textract is an OCR-based function. Which is converting PDF into images and applying OCR functionality to each image. Here I only extracted text and lines but you can also extract other data like tables and location of text from 'Blocks' we get and customize according to our needs. Image quality may affect extracted data.

/nA) Different Languages: 1.English: The quick brown fox jumps over the lazy dog
 2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
 3. Hindi: to dist 
4. French: Le rapide renard brun saute par-dessus le chien paresseux. 
5. Arabic Just 6. Bengali: 50 4/411 91201 
7. Russian: nuca 4epe3 cobaky. 
8. Portuguese: A rápida raposa marrom salta sobre O cachorro preguiçoso. 
9. Urdu: 10. Mandarin Chinese: B) Emoji: . 
C) Table: Name Occupation John Engineer Emily Manager Sarah Designer /n
D) Image: Al DEMOS FutureSmart All Blog Submit Tool Contact us Explore Socials Discover Best Al Tools with Video Demos
E) Hand Written (image) Testing PDF to text extracting tools for different conditions.

Extraction on Different Formats of Data:

  • Normal English Text

  • Popular Languages text ✅ (Not all languages)

  • Emoji ❌

  • Tabular Data ✅

  • Image Data ✅ (Accuracy is good)

  • Handwritten Text ✅ (Accuracy is good)

  • Free ❌

Choosing the Right Tool

The best tool for you depends on your specific needs. Here's a quick guide:

  1. Open-source & offline use : pydf, pdf plumber or pdfminer

  2. For simple tasks: pypdf , pdf plumber

  3. For more complex needs or challenging PDFs: llama parse, pdf plumber, AWS textract

  4. For cloud-based, scalable solution: llama parse, aws extract

  5. Time Taken : (Note: Take this only for reference, results may be different for you as there are many factors that can affect time: System Performance, API response time and Network speed, Document size, and Document Context )

    1. pypdf : 0.113579 sec

    2. pdf plumber: 0.60441 sec

    3. pdfminer : 0.86774 sec

    4. aws textract : 3.22305 sec

    5. llama parse : 5.23055 sec

Conclusion

The array of document parsing tools discussed here provides a diverse toolkit for efficiently extracting text from PDFs, enhancing accessibility and usability. Whether you require a straightforward, open-source solution like PyPDF2 or PDFPlumber for local use, or seek advanced capabilities offered by LlamaParse or AWS Textract for more intricate documents or cloud-based scalability, there's a tool to suit your needs. By delving into these options, you can effectively streamline your PDF text extraction workflow, saving time and effort in today's digital landscape.

Additional Considerations

  • When choosing a tool, consider factors like ease of use, feature set, and pricing (if applicable).

  • Some libraries may require additional dependencies, so be sure to check the documentation before getting started.

  • For complex parsing tasks, you may need to combine multiple tools or write custom code.

Resources and Code :GITHUB LINK

I hope this brief blog provides a helpful starting point for your exploration of document parsing tools!