I want to scrap a PDF file. Where can I find information?

Personally, I had good experience with poppler-utils for text scrapping. For tables, I used Tabula (actually I used tabula-py which is a simple wrapper for Tabula).

Rufus Pollock’s tools review article from 2016 is still relevant today.

If you prefer to read a tutorial, here is one: Get Started With Scraping – Extracting Simple Tables from PDF Documents. Note that it was written at 2013, so some tools may have died by now. Another tutorial: How I parse PDF files.

Finally, a paper about the challenges of scraping PDF files: Towards High-Quality Text Stream Extraction from PDF.

To get more PDF tips and updates, subscribe to my mailing list

Which videoconferencing software supports closed captions?

If you need to explore alternative videoconferencing services, the following may help you. This information is up to date as of March 01, 2021.

Zoom supports both free tier, integration support, closed captions and API for 3rd party captioning.

RingCentral supports free tier, integration support and closed captions. No API for 3rd party captioning.

Webex supports free tier and integration support. No closed captions support was found!

Theodor Herzl and the Basic Income plans

One hundred and twenty years ago, Theodor Herzl published a cautionary tale about what we know today as Universal basic income plans. He probably based his concerns upon the experience of the Romans under the Lex Frumentaria (buy grain from North Africa and Sicily and distribute it to citizens at a low price) plan. See also Gaius Gracchus.

Theodor Herzl wrote the story in German, and it was translated into Hebrew. I remember having read the Hebrew translation of the story at my childhood.

When I wanted to present to English speakers a contrarian point of view about the Universal basic income plans, I found to my surprise that no English translation of the story existed.

Well, now the English translation of Herzl’s story is now available for your enjoyment and education.

Automatic localtime management in ESP8266 and other low-memory IoT devices

Justification

During the last several years, personal computers and smartphones became capable of displaying the local time, correctly adjusted for daylight saving time (DST) – and without requiring human intervention beyond selecting the correct timezone.

Nowadays, there are also some IoT devices, which need to support local time management – displaying it, or otherwise making it available.

Timekeeping is performed using the Internet protocol NTP, which provides the correct UTC. When using a PC or a smartphone, the timezone is usually selected by manual user action.

However, some IoT devices may not have the UI needed for convenient timezone selection. Then it is desirable to support automatic timezone selection as a default.

How to implement automatic timezone selection?

There are some websites, which discover your IP address and provide you with the best guess of your timezone.

Since those websites usually provide the timezone name rather than the string describing the DST transition dates (the so-called tz_string), the next step is to figure out the DST transition dates from the timezone name.

In devices with plenty of memory this is carried out by means of a timezone database.

For example, in Debian/Ubuntu based systems, this database is stored in the /usr/share/zoneinfo directory and occupies 3.5MB (the relevant package in Ubuntu 18.04 is tzdata and its version, as of Nov. 2019, is 2019c-0ubuntu0.18.04).

Memory constrained IoT devices

However, IoT devices are typically based upon memory-constrained controllers and cannot afford to store locally the whole timezone database – just to correctly determine the local time for a single timezone.

Therefore, IoT devices need to access an Internet based service to get the correct timezone information, just as they get UTC time updates using NTP. In other words, those IoT devices effectively outsource the timezone database management.

Internet service for providing the timezone information

An Internet service, for providing the correct tz_string corresponding to a timezone name, needs to keep the timezone database up to date at all times.

I implemented the internet service as follows.

  1. A machine, running an Ubuntu 18.04 installation with a webserver, is used.
  2. The Internet service is implmented as a small WSGI-based website. It uses the database mentioned below.
  3. A script scans the /usr/share/zoneinfo contents and creates a small database for translating timezone names into the corresponding tz_string values.
  4. There is a mechanism for invoking the above script and restarting the web server each time the tzdata package is updated/installed/re-installed.

Show us the code!

The GitHub project tddpirate/tzdata2tzstring includes redacted versions of both an implementation of the above website and a sample ESP8266 client.

Credits

I wish to thank the Python Israel Telegram group members for advice about selecting a Python WSGI framework. I ended up selecting Falcon because benchmarks indicated that it is faster than Flask and Bottle.

The members of the לינוקס Telegram group deserve thanks, too. They helped me find the mechanism for appending my own postprocessing scripts after a Debian/Ubuntu package installation or upgrade.

Another definition of Web 2.0

The Web 2.0 label applies to anything that can be adversely affected by both spam and vandalism.
Spam consists of adding new worthless content.
Vandalism consists of destroying worthy content.

הגדרה נוספת של Web 2.0

התווית Web 2.0 חלה על כל דבר, שניתן להזיק לו על ידי ספאם ועל ידי וונדליזם.

ספאם פרושו הוספת תוכן חסר ערך.

וונדליזם פרושו השחתת תוכן בעל ערך.

Switching blog’s language

בשנה האחרונה היתה לי התלבטות קשה האם להמשיך לכתוב את הבלוג הזה באנגלית או לעבור לעברית.

נימוקים בעד אנגלית:

  • הנושאים, שבהם הבלוג עוסק, הינם בדרך כלל אוניברסליים ואינם מיוחדים לעולם דוברי העברית.
  • רציתי לנהל דו-שיח עם אנשים מכל העולם ולא רק מישראל.
  • הנחתי, שהאנשים המתעניינים בנושאים שכתבתי עליהם, בדרך כלל יודעים גם אנגלית.

נימוקים בעד עברית:

  • בפועל, לא היה לי כמעט דו-שיח עם דוברי אנגלית מחוץ לארץ.
  • אני כותב גם על נושאים ברומו של עולם לקויי השמיעה, ולקויי השמיעה בישראל יחסית יודעים פחות אנגלית מהאוכלוסיה הכללית.
  • נוכחתי לדעת שבפייסבוק, כשאני כותב בעברית, יש הרבה יותר דו-שיח ביני ובין אחרים, ממה שהיה לי כשכתבתי בבלוג באנגלית.

הפתרון האידיאלי הינו, כמובן, לכתוב את הכתבות בבלוג בשתי השפות, כמו שאני עושה בדרך כלל עם הסטטוסים שלי בפייסבוק.

ההחלטה שלי לכבוד השנה האזרחית החדשה היא לעבור לכתיבת הבלוג בעברית פרט לכתבות שנזקקות למונחים מקצועיים רבים (דהיינו תוכנה), שימשיכו להיכתב באנגלית.  הסיבה היא, שסביר להניח שמי שמתעניין בנושאים שלהן – כבר שולט באנגלית.  כמו כן, במידת האפשר לכתוב את הכתבות בשתי השפות.

During the last year, I agonized over the decision whether to continue to write this blog in English, or to switch to Hebrew.

Arguments in favor of English:

  • The topics, with which this blog deals, are usually universal and not specific to the world of Hebrew speakers.
  • I wanted to engage, in dialogue, people from all the world, not only from Israel.
  • I assumed that those people, who are interested in the subjects about which I write, usually know also English.

Arguments in favor of Hebrew:

  • It turned out that I had almost no dialogue with English speakers from abroad.
  • I write also about subjects related to the world of the hearing impaired, and the hearing impaired in Israel know less English than the general population.
  • I found that when I write in Hebrew in Facebook, there are much more dialogues between me and others, than when I wrote in this blog in English.

The ideal solution is, of course, to write my blog items in both languages, like I usually do in my Facebook status lines.

My New Civil Year resolution is to switch to writing this blog in Hebrew, except for items which need several special terms (such as software), which will continue to be written in English.  The reason is that people, who are interested in those subjects, probably already master English.  Also, whenever feasible, I intend to write the blog items in both languages.