I want to scrap a PDF file. Where can I find information?

Personally, I had good experience with poppler-utils for text scrapping. For tables, I used Tabula (actually I used tabula-py which is a simple wrapper for Tabula).

Rufus Pollock’s tools review article from 2016 is still relevant today.

If you prefer to read a tutorial, here is one: Get Started With Scraping – Extracting Simple Tables from PDF Documents. Note that it was written at 2013, so some tools may have died by now. Another tutorial: How I parse PDF files.

Finally, a paper about the challenges of scraping PDF files: Towards High-Quality Text Stream Extraction from PDF.

To get more PDF tips and updates, subscribe to my mailing list

Author: Omer Zak

I am deaf since birth. I played with big computers which eat punched cards and spew out printouts since age 12. Ever since they became available, I work and play with desktop size computers which eat keyboard keypresses and spew out display pixels. Among other things, I developed software which helped the deaf in Israel use the telephone network, by means of home computers equipped with modems. Several years later, I developed Hebrew localizations for some cellular phones, which helped the deaf in Israel utilize the cellular phone networks. I am interested in entrepreneurship, Science Fiction and making the world more accessible to people with disabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.