Automate PDFs with Python – Here’s How

Have you ever wished you could make working with PDF files easier? Imagine being able to merge multiple PDFs into one, split a large PDF into smaller files, or even encrypt and decrypt PDFs, all with just a few lines of Python code. In this article, we’ll explore how you can automate tasks like these using Python and the PyPDF2 library. Whether you’re a student, a professional, or just someone who deals with PDFs regularly, these automation techniques can save you time and effort.

How To Extract Text Form PDF Using Python ?

Install python and create a file with py extension in your editor and copy paste the below code

Note: Don’t forget to pip install pymypdf

import fitz

# Path to the PDF file
pdf_file_path = 'sample.pdf'

# Open the PDF file
pdf_document = fitz.open(pdf_file_path)

# Initialize an empty string to store extracted text
text = ''

# Iterate through each page of the PDF document
for page_num in range(len(pdf_document)):
    # Get a specific page
    page = pdf_document.load_page(page_num)
    # Extract text from the page
    text += page.get_text()

# Close the PDF file
pdf_document.close()

# Print the extracted text
print(text)

Original Pdf

Output After Text Extracted From PDF

Automate PDF Tasks:

You can use PyPDF2 to automate tasks such as merging multiple PDF files into one, splitting a PDF file into multiple files, and rotating pages in a PDF file.

First –

pip install PyPDF2
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter

# Merge PDF files
merger = PdfFileMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('merged_files.pdf')
merger.close()

# Split PDF file
pdf = PdfFileReader('original.pdf')
for page_num in range(pdf.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page_num))
    output_filename = f'page_{page_num+1}.pdf'
    with open(output_filename, 'wb') as out:
        pdf_writer.write(out)

# Rotate pages in a PDF file
pdf_writer = PdfFileWriter()
pdf = PdfFileReader('original.pdf')
for page_num in range(pdf.getNumPages()):
    page = pdf.getPage(page_num)
    page.rotateClockwise(90)  # Rotate clockwise by 90 degrees
    pdf_writer.addPage(page)
with open('rotated_pages.pdf', 'wb') as out:
    pdf_writer.write(out)

Explanation of the above code:

  1. Import Statements:
    • from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter: This line imports specific classes (PdfFileMerger, PdfFileReader, PdfFileWriter) from the PyPDF2 library, which is used to work with PDF files in Python.
  2. Merge PDF Files:
    • merger = PdfFileMerger(): Creates a PdfFileMerger object called merger.
    • merger.append('file1.pdf'): Appends the contents of ‘file1.pdf’ to the merger object.
    • merger.append('file2.pdf'): Appends the contents of ‘file2.pdf’ to the merger object.
    • merger.write('merged_files.pdf'): Writes the merged PDF file to ‘merged_files.pdf’.
    • merger.close(): Closes the merger object, saving the changes.
  3. Split PDF File:
    • pdf = PdfFileReader('original.pdf'): Creates a PdfFileReader object called pdf for ‘original.pdf’.
    • for page_num in range(pdf.getNumPages()):: Iterates over each page in the PDF file.
      • pdf_writer = PdfFileWriter(): Creates a PdfFileWriter object called pdf_writer.
      • pdf_writer.addPage(pdf.getPage(page_num)): Adds the current page to the pdf_writer object.
      • output_filename = f'page_{page_num+1}.pdf': Generates a filename for the current page.
      • with open(output_filename, 'wb') as out:: Opens the file in write mode.
        • pdf_writer.write(out): Writes the current page to the file.
  4. Rotate Pages in a PDF File:
    • pdf_writer = PdfFileWriter(): Creates a new PdfFileWriter object called pdf_writer.
    • pdf = PdfFileReader('original.pdf'): Creates a PdfFileReader object called pdf for ‘original.pdf’.
    • for page_num in range(pdf.getNumPages()):: Iterates over each page in the PDF file.
      • page = pdf.getPage(page_num): Gets the current page.
      • page.rotateClockwise(90): Rotates the page clockwise by 90 degrees.
      • pdf_writer.addPage(page): Adds the rotated page to the pdf_writer object.
    • with open('rotated_pages.pdf', 'wb') as out:: Opens the file in write mode.
      • pdf_writer.write(out): Writes the rotated pages to the file.

Encrypt and Decrypt PDF:

PyPDF2 can also be used to encrypt and decrypt PDF files.

from PyPDF2 import PdfFileWriter, PdfFileReader

# Encrypt a PDF file
pdf_writer = PdfFileWriter()
pdf_writer.appendPagesFromReader(PdfFileReader('file_to_encrypt.pdf'))
pdf_writer.encrypt('password', 'owner_password', use_128bit=True)
with open('encrypted_file.pdf', 'wb') as out:
    pdf_writer.write(out)

# Decrypt a PDF file
pdf_reader = PdfFileReader('encrypted_file.pdf')
if pdf_reader.isEncrypted:
    pdf_reader.decrypt('password')
    decrypted_pdf_writer = PdfFileWriter()
    for page_num in range(pdf_reader.getNumPages()):
        decrypted_pdf_writer.addPage(pdf_reader.getPage(page_num))
    with open('decrypted_file.pdf', 'wb') as out:
        decrypted_pdf_writer.write(out)

Explanation of the above code:

  1. Import Statements:
    • from PyPDF2 import PdfFileWriter, PdfFileReader: This line imports the PdfFileWriter and PdfFileReader classes from the PyPDF2 library, which is used for working with PDF files in Python.
  2. Encrypt a PDF File:
    • pdf_writer = PdfFileWriter(): Creates a new PdfFileWriter object called pdf_writer, which is used to write PDF files.
    • pdf_writer.appendPagesFromReader(PdfFileReader('file_to_encrypt.pdf')): Reads the pages from the ‘file_to_encrypt.pdf’ file and appends them to the pdf_writer object.
    • pdf_writer.encrypt('password', 'owner_password', use_128bit=True): Encrypts the PDF file using the provided password. The owner_password is used to control permissions like printing or copying text, and use_128bit=True specifies to use 128-bit encryption.
    • with open('encrypted_file.pdf', 'wb') as out:: Opens a new file called ‘encrypted_file.pdf’ in write-binary mode.
      • pdf_writer.write(out): Writes the encrypted PDF content to the ‘encrypted_file.pdf’ file.
  3. Decrypt a PDF File:
    • pdf_reader = PdfFileReader('encrypted_file.pdf'): Creates a PdfFileReader object called pdf_reader for the ‘encrypted_file.pdf’ file.
    • if pdf_reader.isEncrypted:: Checks if the PDF file is encrypted.
      • pdf_reader.decrypt('password'): Decrypts the PDF file using the provided password.
      • decrypted_pdf_writer = PdfFileWriter(): Creates a new PdfFileWriter object called decrypted_pdf_writer.
      • for page_num in range(pdf_reader.getNumPages()):: Iterates over each page in the decrypted PDF file.
        • decrypted_pdf_writer.addPage(pdf_reader.getPage(page_num)): Adds each page to the decrypted_pdf_writer object.
      • with open('decrypted_file.pdf', 'wb') as out:: Opens a new file called ‘decrypted_file.pdf’ in write-binary mode.
        • decrypted_pdf_writer.write(out): Writes the decrypted PDF content to the ‘decrypted_file.pdf’ file.

Error Handling and Best Practices: When working with PDF files using PyPDF2, it’s important to handle errors gracefully to ensure your script behaves predictably. Here are some best practices for handling exceptions:

  1. Use try-except blocks: Wrap your PDF processing code in try-except blocks to catch and handle exceptions. This helps prevent your script from crashing if an error occurs.
  2. Handle specific exceptions: PyPDF2 provides several specific exceptions, such as PdfReadError and PdfWriteError, which you can use to handle different types of errors. For example, you might want to handle a PdfReadError differently from a general Exception.
  3. Log errors: Use the logging module to log errors and messages to a file or console. This can help you diagnose issues and troubleshoot problems with your script.
  4. Graceful exit: When an error occurs, consider exiting the script gracefully to prevent any further processing that could lead to more errors. You can use sys.exit() to exit the script with a specific exit code.
  5. Test with sample files: Before running your script on production files, test it with sample PDF files to ensure it behaves as expected and handles errors correctly.

Summary,

As we’ve seen, Python and the PyPDF2 library provide powerful tools for automating tasks related to PDF files. Whether you need to merge, split, encrypt, or decrypt PDFs, Python makes it easy to create scripts that can handle these tasks efficiently. By incorporating these automation techniques into your workflow, you can streamline your PDF-related work and focus on more important tasks. So why not give it a try and see how much time and effort you can save?

Author

Sona Avatar

Written by

Leave a Reply

Trending

CodeMagnet

Your Magnetic Resource, For Coding Brilliance

Programming Languages

Web Development

Data Science and Visualization

Career Section

<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-4205364944170772"
     crossorigin="anonymous"></script>