Popular PDF Parsing Tools for Efficient Document Extraction
Extracting Text From PDFs
Introduction
In today's digital age, it's common for businesses and individuals to be overwhelmed by large collections of documents in various formats, including PDFs. Extracting valuable information from these documents efficiently can be a challenging task. However, with the advancement of document parser tools, this process has become significantly easier. In this blog post, we will examine some popular options and provide a comparison to help you make an informed decision and choose the best tool for your needs.
Sample PDF :
- This is a sample PDF we are using to compare
1. PYPDF :
pypdf is a user-friendly and open-source Python library for manipulating PDFs as well as extracting text from PDF documents. It's perfect for simple tasks and can be easily used on your local machine. Plus, it's free to use and readily available for implementation. With pypdf, you can extract text quickly and efficiently, making it a valuable tool for developers and individuals alike.
Code To use
import pypdf
def pypdf_extract(pdf_path):
"""Extracts text using pypdf"""
with open(pdf_path, 'rb') as pdf_file:
pdf_reader = pypdf.PdfReader(pdf_file)
text = ''
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text +="\n"+ page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False)
return text
Extracted Text From Sample PDF:
A) Different Languages:
1.English: The quickbrown foxjumps over thelazy dog
2.Spanish: Elrápido zorromarrón saltasobre elperro perezoso.
3.Hindi: तेजभूरालोमड़ी सु�त कु�ेपरकूदताहै।
4. French: Le rapiderenard brun sautepar-dessus lechien paresseux.
5. Arabic: لوﺳﻛﻟابﻠﻛﻟاقوﻓزﻔﻘﯾﻊﯾرﺳﻟاﻲﻧﺑﻟابﻠﻌﺛﻟا.
6.Bengali: �ত বাদািমখরেগাশঅলস ��েরর উপর লাফায়।
7.Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8.Portuguese: A rápidaraposa marrom saltasobre o cachorro preguiçoso.
9.Urdu: ۔ﮯﮨ ﺎﺗدوﮐرﭘواﮯﮐﮯﺗﮐتﺳﺳ یڑﻣوﻟیروﮭﺑزﯾﺗ
10.Mandarin Chinese: 快 速 的 棕 色 狐 狸 跳 过 懒 狗。
B) Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁☹😣😖😫😩🥺
😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
C) Table:
Name Occupation
John Engineer
Emily Manager
Sarah Designer
D) Image:
E) Hand Written (image)
Extraction on Different Formats of Data:
Normal English Text ✅
Popular Languages text ✅
Emoji ✅
Tabular Data ✅
Image Data ❌
Handwritten Text ❌
Free ✅
2. Llama Parse :
LlamaParse is a tool that helps computers understand complicated documents, especially PDFs with tables and charts. It's like a super-powered decoder ring that unlocks the information hidden inside these documents. This lets you build systems that can search and answer questions about them, even the tricky ones. You can try out LlamaParse yourself through their public preview to see how well it works.
LINK:Learn more about LlamaParse
Code To Use
import nest_asyncio
from llama_parse import LlamaParse
def llama_parse_extract(pdf_path):
nest_asyncio.apply()
parser = LlamaParse(
api_key= # LLAMA_CLOUD_API_KEY : you can get it from website (it is free now)
result_type="text", # "markdown" and "text" are available
)
data= parser.load_data(pdf_path)
return data[0].text
Extracted Text From Sample PDF:
A) Different Languages:
1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: तेज भूरा लोमड़ी सु▯त कु▯े पर कूदता है।
4. French: Le rapide renard brun saute par-dessus le chien paresseux.
5. Arabic: ولﺳﻛﻟا بﻠﻛﻟا وقﻓ زﻔﻘﯾ ﻊﯾرﺳﻟا ﻲﻧﺑﻟا بﻠﻌﺛﻟا.
6. Bengali: ▯ত বাদািম খরেগাশ অলস ▯▯েরর উপর লাফায়।
7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
9. Urdu: ۔ےہ ﺎﺗودک رپاو ےک ےﺗک تﺳﺳ ڑیﻣوﻟ وریﮭﺑ زﯾﺗ
10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。
B) Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁☹😣😖😫😩🥺
😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
C) Table:
Name Occupation
John Engineer
Emily Manager
Sarah Designer
---
D) Image:
Al DEMOS Blog
FutureSmart AI Socials
Submit ToolContact usExplore
Discover Best Al Tools with
Video Demos
E) Hand Written (image)
Testing PDf to text extaqcting tools
For df ferent tondftions-
Extraction on Different Formats of Data:
Normal English Text ✅
Popular Languages text ✅
Emoji ✅
Tabular Data ✅
Image Data ✅ (Accuracy is low: some data might not extracted)
Handwritten Text ✅ (Accuracy is low: some data might not be extracted)
Free ✅ (API key is free for trial. Check the plans for more info)
3. PDFMiner :
PDFMiner is a user-friendly and open-source Python library for extracting text from PDF documents. Unlike pypdf, PDFMiner is primarily focused on the task of extracting text from PDF documents. it's free to use, works offline and readily available for implementation. With PDFMiner, you can extract text quickly with better accuracy with better customization. Also, simplify your document parsing process.
LINK:Learn more about PDFMiner
Code To Use
!pip install pdfminer.six
from pdfminer.high_level import extract_text
def pdfminer_extract(pdf_path):
"""Extracts text using pdfminer.six."""
with open(pdf_path, 'rb') as pdf_file:
text = extract_text(pdf_file)
return text
Extracted Text From Sample PDF:
A) Different Languages:
1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: तेज भूरा लोमड़ी सु त कु े पर कू दता है।
4. French: Le rapide renard brun saute par-dessus le chien paresseux.
زﻔﻘﯾ
بﻠﻛﻟا
لوﺳﻛﻟا
5. Arabic:
قوﻓ
6. Bengali: ত বাদািম খরেগাশ অলস েরর উপর লাফায়।
7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
ﮯﮐ
10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。
9. Urdu:
.
بﻠﻌﺛﻟا
یروﮭﺑ
یڑﻣوﻟ
ﻊﯾرﺳﻟا
تﺳﺳ
ﺎﺗدوﮐ
ﻲﻧﺑﻟا
ﮯﺗﮐ
رﭘوا
۔ﮯﮨ
زﯾﺗ
B) Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁☹😣😖😫😩🥺
😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
C) Table:
Name
Occupation
John
Engineer
Emily
Manager
Sarah
Designer
D) Image:
E) Hand Written (image)
Extraction on Different Formats of Data:
Normal English Text ✅
Popular Languages text ✅
Emoji ✅
Tabular Data ❌ (Extract text but unable to extract proper order or format, may need to customise)
Image Data ❌
Handwritten Text ❌
Free ✅
4. pdf plumber :
A PDF plumber is a user-friendly and open-source high-level Python library built on top of PDFMiner.Six. It offers a simpler interface for extracting text and metadata from PDFs. Also detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
LINK:Learn more about PDFPlumber
Code To Use
import pdfplumber
def pdfplumber_extract(pdf_path):
"""Extracts text using pdfplumber."""
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text()
return text
Extracted Text From Sample PDF:
A) Different Languages:
1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: तजे भरू ा लोमड़ी स ु त कु े पर कूदता है।
4. French: Le rapide renard brun saute par-dessus le chien paresseux.
5. Arabic: لوﺳﻛﻟا بﻠﻛﻟا قوﻓ زﻔﻘﯾ ﻊﯾرﺳﻟا ﻲﻧﺑﻟا بﻠﻌﺛﻟا.
6. Bengali: ত বাদািম খরেগাশ অলস েরর উপর লাফায়।
7. Russian: Быстрая коричневая лиса перепрыгивает через ленивую собаку.
8. Portuguese: A rápida raposa marrom salta sobre o cachorro preguiçoso.
9. Urdu: ۔ﮯﮨ ﺎﺗدوﮐ رﭘوا ﮯﮐ ﮯﺗﮐ تﺳﺳ یڑﻣوﻟ یروﮭﺑ زﯾﺗ
10. Mandarin Chinese: 快速的棕色狐狸跳过懒狗。
B) Emoji:
😀😃😄😁😆😅😂🤣🥲🥹☺😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛
😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁☹😣😖😫😩🥺
😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓
C) Table:
Name Occupation
John Engineer
Emily Manager
Sarah DesignerD) Image:
E) Hand Written (image)
Extraction on Different Formats of Data:
Normal English Text ✅
Popular Languages text ✅
Emoji ✅
Tabular Data ✅ (Extract Data row-wise)
Image Data ❌
Handwritten Text ❌
Free ✅
5. AWS Textract :
AWS Textract is a machine learning service offered by Amazon Web Services (AWS) that can automatically extract text, handwriting, and other data from scanned documents not only recognizes text but also understands and extracts specific data from documents like forms and tables.
Some of the key features of AWS Textract:
Automating data entry: Extracting data from invoices, receipts, tax forms, and other business documents.
Creating intelligent search indexes: Enabling efficient search of scanned documents by indexing the extracted text.
Improving document processing workflows: Streamlining tasks like loan processing, insurance claims, and legal document review.
LINK:Learn more about AWS Textract
Code To Use
import boto3
from dotenv import load_dotenv
import os
from PIL import Image
import fitz # PyMuPDF
AWS_ACCESS_KEY_ID= # Give the access key
AWS_SECRET_ACCESS_KEY= # Give Secret Access key
# Create a boto3 session with access keys
session = boto3.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
textract_client = session.client('textract',region_name='us-east-1')
def textract_extract(pdf_path):
"""Extracts text using Amazon Textract with access keys."""
import io
pdf_document = fitz.open(pdf_path)
pages = ""
for page_number in range(pdf_document.page_count):
page = pdf_document.load_page(page_number)
image_data = page.get_pixmap().tobytes()
with io.BytesIO(image_data) as img_buffer:
response = textract_client.detect_document_text(Document={'Bytes': img_buffer.read()})
# We can take other data also but for simplicity I have took only word and line
# specifiy "pages" variable as output you wanted as i need only text so i took it
# as a sting if you want in json you can modify accordingly
blocks = response['Blocks']
text = ""
for block in blocks:
if block['BlockType'] in ['WORD']:
text += block['Text'] + ' '
pages+="/n"+text # Add newline for readability
return pages
Extracted Text From Sample PDF:
NOTE: Textract is an OCR-based function. Which is converting PDF into images and applying OCR functionality to each image. Here I only extracted text and lines but you can also extract other data like tables and location of text from 'Blocks' we get and customize according to our needs. Image quality may affect extracted data.
/nA) Different Languages: 1.English: The quick brown fox jumps over the lazy dog
2. Spanish: El rápido zorro marrón salta sobre el perro perezoso.
3. Hindi: to dist
4. French: Le rapide renard brun saute par-dessus le chien paresseux.
5. Arabic Just 6. Bengali: 50 4/411 91201
7. Russian: nuca 4epe3 cobaky.
8. Portuguese: A rápida raposa marrom salta sobre O cachorro preguiçoso.
9. Urdu: 10. Mandarin Chinese: B) Emoji: .
C) Table: Name Occupation John Engineer Emily Manager Sarah Designer /n
D) Image: Al DEMOS FutureSmart All Blog Submit Tool Contact us Explore Socials Discover Best Al Tools with Video Demos
E) Hand Written (image) Testing PDF to text extracting tools for different conditions.
Extraction on Different Formats of Data:
Normal English Text ✅
Popular Languages text ✅ (Not all languages)
Emoji ❌
Tabular Data ✅
Image Data ✅ (Accuracy is good)
Handwritten Text ✅ (Accuracy is good)
Free ❌
Choosing the Right Tool
The best tool for you depends on your specific needs. Here's a quick guide:
Open-source & offline use : pydf, pdf plumber or pdfminer
For simple tasks: pypdf , pdf plumber
For more complex needs or challenging PDFs: llama parse, pdf plumber, AWS textract
For cloud-based, scalable solution: llama parse, aws extract
Time Taken : (Note: Take this only for reference, results may be different for you as there are many factors that can affect time: System Performance, API response time and Network speed, Document size, and Document Context )
pypdf : 0.113579 sec
pdf plumber: 0.60441 sec
pdfminer : 0.86774 sec
aws textract : 3.22305 sec
llama parse : 5.23055 sec
Conclusion
The array of document parsing tools discussed here provides a diverse toolkit for efficiently extracting text from PDFs, enhancing accessibility and usability. Whether you require a straightforward, open-source solution like PyPDF2 or PDFPlumber for local use, or seek advanced capabilities offered by LlamaParse or AWS Textract for more intricate documents or cloud-based scalability, there's a tool to suit your needs. By delving into these options, you can effectively streamline your PDF text extraction workflow, saving time and effort in today's digital landscape.
Additional Considerations
When choosing a tool, consider factors like ease of use, feature set, and pricing (if applicable).
Some libraries may require additional dependencies, so be sure to check the documentation before getting started.
For complex parsing tasks, you may need to combine multiple tools or write custom code.