I tried to find a solution for processing PDF files.
The newer Python package called "unstructured" I tested was a disaster and a waste of time and resources.
Today I will show you tests with the Python package called pypdf with version: 4.1.0.
You can find it on the official page.
Installation is simple with the pip tool and you can also add options offered by crypto.
pip install pypdf[crypto]
Collecting pypdf[crypto]
Downloading pypdf-4.1.0-py3-none-any.whl.metadata (7.4 kB)
...
Installing collected packages: pypdf
Successfully installed pypdf-4.1.0
Here is some information displayed with the show option.
python -m pip show pypdf
Name: pypdf
Version: 4.1.0
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page:
Author:
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License:
Location: C:\Python312\Lib\site-packages
Requires:
Required-by:
I create a little script for testing:
import os
from pypdf import PdfReader
from pypdf import PdfWriter
#PdfMerger is deprecated and will be removed in pypdf 5.0.0. Use PdfWriter instead.
#from pypdf import PdfMerger
pdf_file = PdfReader("invoice-001.pdf")
print("Size in pages : ",len(pdf_file.pages))
print("========")
page = pdf_file .pages[0]
print("Page : ", page)
print("========")
text = page.extract_text()
print("Page text : ", text)
print("========")
print("PDF Metadata : ", pdf_file.metadata)
print("PDF Metadata - Title: ", pdf_file.metadata.title)
print("========")
pdf_writer = PdfWriter("invoice-002.pdf")
page = pdf_writer.add_blank_page(width=8.27 * 72, height=11.7 * 72)
pdf_writer.write("invoice-002.pdf")
from pypdf import PdfWriter
merger = PdfWriter()
for pdf in ["invoice-001.pdf", "invoice-002.pdf"]:
merger.append(pdf)
merger.write("invoice-003.pdf")
merger.close()
The result is this:
python test_001.py
Size in pages : 1
========
Page : {'/Type': '/Page', '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI'], '/ExtGState':
{'/G3': IndirectObject(3, 0, 2484091272080)}, '/XObject': {'/X4': IndirectObject(4, 0, 2484091272080)}, '/Font': {'/F7':
IndirectObject(7, 0, 2484091272080), '/F8': IndirectObject(8, 0, 2484091272080)}}, '/MediaBox': [0, 0, 612, 792],
'/Contents': IndirectObject(9, 0, 2484091272080), '/StructParents': 0, '/Parent': IndirectObject(10, 0, 2484091272080)}
========
Page text : Dino Store
227 Cobblestone Road
30000 Bedrock, Cobblestone County
+555 7 789-1234
https://dinostore.bed | hello@dinostore.bedPayment details:
ACC:123006705
IBAN:US100000060345
SWIFT:BOA447
Bill to:
Slate Rock and Gravel Company
222 Rocky Way
30000 Bedrock, Cobblestone County
+555 7 123-5555
fred@slaterockgravel.bedInvoice No. 1
Invoice Date: 03.03.2024
Issue Date: 03.03.2024
Due Date: 02.04.2024
INVOICE
Item Quantity Price Discount Tax Linetotal
1 Test 001 1 50,00 € 1% 19% 49,50 €
2 Test 002 2 40,00 € 2% 19% 78,40 €
3 Frozen Brontosaurus Ribs 1 100,00 € 0% 19% 100,00 €
Subtotal: 227,90 €
Tax 19%: 43,30 €
Total: 271,20 €
Terms & Notes
Fred, thank you very much. We really appreciate your business.
Please send payments before the due date.
========
PDF Metadata : {'/Creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0', '/Producer': 'Skia/PDF m122', '/CreationDate':
"D:20240304221509+00'00'", '/ModDate': "D:20240304221509+00'00'"}
PDF Metadata - Title: None
========
The run of the script will create a second blend PDF named invoice-002 then will merge with the first one will result a PDF named : invoice-003.pdf .