Personally, I had good experience with poppler-utils for text scrapping. For tables, I used Tabula (actually I used tabula-py which is a simple wrapper for Tabula).
Rufus Pollock’s tools review article from 2016 is still relevant today.
If you prefer to read a tutorial, here is one: Get Started With Scraping – Extracting Simple Tables from PDF Documents. Note that it was written at 2013, so some tools may have died by now. Another tutorial: How I parse PDF files.
Finally, a paper about the challenges of scraping PDF files: Towards High-Quality Text Stream Extraction from PDF.
Like this:
Like Loading...
Related Posts
Author: Omer Zak
I am deaf since birth. I played with big computers which eat punched cards and spew out printouts since age 12. Ever since they became available, I work and play with desktop size computers which eat keyboard keypresses and spew out display pixels.
Among other things, I developed software which helped the deaf in Israel use the telephone network, by means of home computers equipped with modems. Several years later, I developed Hebrew localizations for some cellular phones, which helped the deaf in Israel utilize the cellular phone networks.
I am interested in entrepreneurship, Science Fiction and making the world more accessible to people with disabilities.
View all posts by Omer Zak