Convert Apache2.4 configuration and rewrite rules into Nginx configuration

Those days I am moving my website from Apache2.4 to Nginx.

So I googled for “convert apache rewrite to nginx”. I found the folllowing articles:

In addition to the aforementioned articles, I found the following SaaS services for converting Apache2.4 configuration files into Nginx configuration files.

I recommend that you run your conversion in both winginx.com and getpagespeed.com, review and edit the resulting Nginx configuration file.

Cheatsheet for doing rsync for archiving

# -avHAX:
# -a = archive (equivalent to -rlptgoD)
#      -r = recurse into directories
#      -l = copy symlinks as symlinks
#      -p = preserve permissions
#      -t = preserve modification times
#      -g = preserve group
#      -o = preserve owner (requires sudo)
#      -D = --devices --specials
#           --devices = preserve device files (requires sudo)
#           --specials = preserve special files
# -v = verbose
# -H = preserve hard links
# -A = preserve ACLs
# -X = preserve xattrs
#
# Before actual work, you may want to use --dry-run.
sudo rsync -avHAX --progress --exclude-from=- $SRCDIR $DESTDIR <<EOF
- /temp/***
- /trash_script.sh
EOF

Include fragments in static site source files to be processed by Pelican

I use Pelican to generate static websites using markdown files.

I found that there are text fragments that I want to reuse in several of my web pages, but not in all of them. So I needed a way to include those fragments in the markdown pages, using a suitable include directive, rather than add them to the template files used by Pelican to generate the final HTML files.

I googled for a way to use include files in markdown files to be processed by Pelican. I was led to unhelpful answers. For example, the mdx_include module. I rejected it because I found no easy way for lazy people to integrate it into Pelican.

Eventually I converged upon the following solution.

Pelican uses the Jinja2 templating engine, which has the {% include 'filename' %} directive.

By default, Pelican subjects only template files to Jinja2 processing. So, to use the include directive in my own markdown files, I need a way to have them go through Jinja2 processing as well.  This is what the jinja2content Pelican plugin does.

To install it:

pip install pelican-jinja2content

To use it, add the following to your `pelicanconf.py` file:

PLUGINS = [
.
.
.
'pelican.plugins.jinja2content',
.
.
.
]

# jinja2content configuration

IGNORE_FILES = [
'includes'
]
JINJA2CONTENT_TEMPLATES = [
'./includes',
]

Then add your include files to the ./content/includes subdirectory. They may have any extension you want including .md. I recommend .mdinc to tell them apart from your article markdown files. Instead of includes, you may use any subdirectory name by changing the configuration above.

Before doing sophisticated things with your include files, read the plugin’s documentation and understand the limitations of using Jinja2 tags inside your include and markdown files.

Fun with Python’s “import this”

Everyone knows that when you perform import this, in your Python script, you get a printout of The Zen of Python, by Tim Peters.

However, what else is there in the this module?

>>> import this
The Zen of Python, by Tim Peters
.
.
.
>>> help(this)
Help on module this:

NAME
    this

MODULE REFERENCE
    https://docs.python.org/3.8/library/this
.
.
.
DATA
    c = 97
    d = {'A': 'N', 'B': 'O', 'C': 'P', 'D': 'Q', 'E': 'R', ...
    i = 25
    s = "Gur Mra bs Clguba, ol Gvz Crgref\n\nOrnhgvshy vf o...
FILE
    ...

It looks like The Zen of Python text is stored in the module as this.s in ROT-13 encrypted form. The table for decoding this text is stored in this.d.

What is the most Pythonic way to decode the text in this.s?

  1. Loop over the characters in this.s:
    [something(c) for c in this.s]
  2. Decode c if it is encoded (is in the this.d dictionary):
    [(this.d[c] if c in this.d else c) for c in this.s]
  3. The result of the comprehension is a list of characters. Make a string out of it:
    ''.join([(this.d[c] if c in this.d else c) for c in this.s])
  4. Now, print the result:
    print(''.join([(this.d[c] if c in this.d else c) for c in this.s]))

That’s all, folks!

Speech to Text in Hebrew

Yesterday evening I experimented with two STT (speech to text) services – Live Transcribe and WebCaptioner.

I operated both of them in a meeting whose language was Hebrew. The meeting included a lecture in a hall and remote connection via Zoom. I elected to connect via Zoom.
Live Transcribe was operated on a tablet, which evasdropped on my laptop via speakers+microphone, and WebCaptioner was operated on a browser running on my laptop. Zoom ran on my laptop as an application. Both STT services were setup to recognize Hebrew speech.

My finding was that most of the time, the services did not deliver the goods. They emitted Hebrew words without grammar and out of the meeting’s context. However, there were moments in which they worked correctly.
I also noticed that when the services did not work correctly, each of them had different output. When they worked correctly, the texts they produced were similar to each other.

During the meeting, the lecturers did not use a wireless microphone located near their mouths. They stood at different distances from the microphone. When they stood near the microphone, the services worked better than when the lecturers stood far away from the microphone.
In “worked better”, I mean that the services continuously emitted text, rather than long periods of no activity with interspersed short text segments.

The above confirms what I found long time ago – STT services need to receive the same treatment as the HOH (hard of hearing people). Just as environmental noises interfere with HOH ability to understand speech, they interfere also with STT services.

המרת דיבור לטקסט בעברית

אתמול בערב עשיתי ניסוי ב-Live Transcribe וב-WebCaptioner.

הפעלתי את שתיהן במפגש שהתקיים בעברית. המפגש כלל הרצאה באולם + אפשרות להתחבר דרך זום. בחרתי להתחבר דרך זום.
ה-Live Transcribe הופעל על טאבלט שצותת ללאפטופ באמצעות רמקולים+מיקרופון, ואילו WebCaptioner הופעל בדפדפן על אותו הלאפטופ שעליו זום רץ כאפליקציה. שתי התוכנות כוונו לזהות דיבור בעברית.

הממצא שלי היה שרוב הזמן שתי התוכנות לא סיפקו את הסחורה. הן פלטו מילים עבריות ללא תחביר וללא קשר עם נושא המפגש. עם זאת היו רגעים שבהם הן עבדו נכון.
שמתי לב גם שכשהתוכנות לא עובדות נכון, כל אחת מוציאה פלט אחר. כשהן עובדות נכון, הטקסטים שהן מוציאות דומים זה לזה.

במפגש, המרצים לא השתמשו במיקרופון אלחוטי שנמצא קרוב מאוד לפה שלהם, אלא היו עומדים במרחקים משתנים מהמיקרופון. כשעמדו קרוב למיקרופון, התוצאות היו יותר טובות מאשר כשהם עמדו רחוק ממנו.
ב-“יותר טובות” אני מתכוון לכך שהתוכנות פלטו כל הזמן טקסט, במקום שתיקות ארוכות עם קטעי טקסט קצרים מפעם לפעם.

זה מאשר את הממצא שעליתי עליו לפני הרבה זמן – צריך להתייחס לתוכנות לזיהוי דיבור ממוחשב כמו אל כבדי שמיעה. כמו שרעשי סביבה מפריעים להם מאוד להבין דיבור, גם לתוכנות רעשי סביבה מפריעים מאוד.

Want to forward messages with images in Evolution E-mail client

Problem

I use the E-mail client Evolution running under Linux (Debian Buster distribution), version 3.30.5-1.1. I also configure it to default to Plain Text format when creating new messages.

I found that when I get a HTML message with embedded images and want to forward it to someone else, the text is forwarded but not the images, even when I set the forwarded message’s format to HTML (instead of Plain Text).

This happens in all four possible ways of forwarding the message (Forward As Attached, Forward As Inline, Forward As Quoted, Redirect).

Solution

The workaround I found is to reconfigure Evolution to default to HTML format before forwarding the message, and return to Plain Text afterwards. So when I click on the Forward button and Evolution initially constructs the message to be forwarded, it includes all contents of the original message, including the embedded images.

To reconfigure the default message format:

  1. Open the “Evolution Preferences” pop up dialog: Edit / Preferences.
  2. Select the pop up dialog pane: Composer Preferences / General.
  3. Toggle checkmark in Default Behavior / Format messages in HTML.
  4. Click on the Close button at bottom right of the dialog.

After defaulting to HTML format, forwarding an HTML-formatted message with images preserves the images, in all four possible ways of forwarding the message.

(Submitted to gitlab.gnome.org as Evolution issue #1406 and to bugs.debian.org as bug #984599.)

What to reply to a computer science student who asked you to be his accomplice in cheating?

You probably are familiar with the phenomenon of students, who pay other people to write term papers, theses and projects for them to submit in order to meet academic requirements.

Few years ago, a computer science student named R. (a pseudonym) approached me and asked me to write for him and his partner a computer program, so that they will submit it to meet a requirement in order to pass a course, which they were studying.

Instead of taking money from him, I replied to him as follows.

I am approaching your question from the point of view of a mentor, teacher or a wise person needing to advise a young person, who is in a difficult situation and who is considering a bad solution to his problem. What the young person really needs is not to have someone else do his project for him, but long-term thinking: what are the long-term consequences of this solution, what alternative solutions exist, which obstacles exist in the alternatives, how to overcome those obstacles, the need to summon courage to change course.

For starters, as far as I am concerned, what you asked for is in the grey area between cheating and having an original solution to the problem. This is because certificates are not worth that much in the vocation of software development. Either the developer knows how to program or he doesn’t know, no matter what degrees or impressive certificates he has. If he does not know how to program, then within half a year his employer, if the employer has a clue, knows about it, and gives him a kick in the ass – reducing the long-term damage. Also, there are several people, who take on big projects and hire other people to do the actual work. However, the difference is that they have to provide the project with services such as marketing skills, project management, search and selection of development tools, money handling, etc. – instead of (or in addition to) software development skills.

Now to the point. Before proceeding further with what you and your partner are contemplating doing – I highly recommend that both of you read Ayn Rand’s “The Fountainhead” and follow Peter Keating’s career development in the book. He started out relying upon other people, like you are contemplating doing, made an impression on the right people and reached the top of his profession.

But… he didn’t last long and eventually he fell. And the sad truth is that he trained for the wrong vocation. There was a vocation that suited him perfectly, and he could really excel at it, but his mother pressed him to learn the vocation he actually learned (and in which he eventually failed). The saddest thing about his story is that when he realized which vocation is right for him and started engaging in it – it turned out that he started it too late and could not reach a high level of proficiency in that vocation.

If you and your partner decide to pay someone else to do your project, then:

  • Anyone, who knows that you have done this, will be unable to help you look for a job, because they will have to lie if they vouch for your software development skills.
  • During software development work, there are periods of extreme pressure. Schools plan their course syllabuses so that an average student can handle the resulting pressure (with some sighs and groans). At work, pressure can be unlimited. So if you are unable to cope with pressure in school, it is very unlikely that you can cope with it at work. So you should consider a vocation, in which there is no such pressure.
  • You give up the fight to be really good professionals, who know when to accept failure like men (even at work there are some projects which fail, due to all kind of reasons, such as over-optimistic effort estimates, and it’s better to admit failure and move on to another project), and instead of accepting failure and its consequences, you are heading toward pretense.
What to do now?

I suggest that you first carefully review the decisionmaking process that led you to decide on a vocation in the software world. If you have taken psychotechnic tests and consulted with a specialist in the area of vocational selection, one of the tests was probably as follows:

  1. Go over a very long list of topics and highlight those which interest you.
  2. Group the interesting topics into groups, such that the topics in each group have the same theme from your point of view.
  3. Go over the groups and identify potential vocations related to each group.

Why am I telling you all this? Because if you kept the papers from your evaluation (or you can get them), you might find there a clue for identifying a vocation, which really attracts you and in which you can excel.

The next step is to determine if you have relatives, who are unwilling to accept that your future is not in the lucrative and profitable software world, but in another direction. Then check if and how to neutralize their influence upon your choice of the vocation that fits you.

I assume that the computer world is appealing to you, so you may want to check out some other vocations in this world besides writing software (I remember that in Hadassah Institute for Professional Selection Counseling in Jerusalem, where I did my vocational counselling, there was a library with descriptions of thousands of vocations – such a library could help you choose the right vocation for you). Examples: training, installation and configuring, software testing, maybe even administrative project management. Then go on to specialize in the vocation that suits you and in which you can excel.

True, you already started studying and already invested two years in your studies, and now I am proposing to write off all this investment and start over? Yes, however as far as getting a certificate or a degree is concerned, some of the investment will probably be lost. But as I said above, certificates are not that valuable in the software world. Like a pilot’s license does not turn someone, not having the aptitude to pilot, into an ace fighter pilot; also a software developer’s certificate does not turn someone not fit to be software developer into a great software developer. In terms of content – I’m sure you’ve learned something that will help you in any direction you choose for the rest of your life. And as far as the requirements for finishing your studies are concerned, once you know which direction is right for you, you probably can switch to a major which fits your vocational goals. In this case, you’ll probably be able to use some of the credits of the courses that you already completed. So what you already studied is not a total loss.

P.S.:

A student, who is paying someone else to do his homework, term papers, projects or theses, is like a basketball player who is paying someone else to go to his team’s practice sessions.

Automatic localtime management in ESP8266 and other low-memory IoT devices

Justification

During the last several years, personal computers and smartphones became capable of displaying the local time, correctly adjusted for daylight saving time (DST) – and without requiring human intervention beyond selecting the correct timezone.

Nowadays, there are also some IoT devices, which need to support local time management – displaying it, or otherwise making it available.

Timekeeping is performed using the Internet protocol NTP, which provides the correct UTC. When using a PC or a smartphone, the timezone is usually selected by manual user action.

However, some IoT devices may not have the UI needed for convenient timezone selection. Then it is desirable to support automatic timezone selection as a default.

How to implement automatic timezone selection?

There are some websites, which discover your IP address and provide you with the best guess of your timezone.

Since those websites usually provide the timezone name rather than the string describing the DST transition dates (the so-called tz_string), the next step is to figure out the DST transition dates from the timezone name.

In devices with plenty of memory this is carried out by means of a timezone database.

For example, in Debian/Ubuntu based systems, this database is stored in the /usr/share/zoneinfo directory and occupies 3.5MB (the relevant package in Ubuntu 18.04 is tzdata and its version, as of Nov. 2019, is 2019c-0ubuntu0.18.04).

Memory constrained IoT devices

However, IoT devices are typically based upon memory-constrained controllers and cannot afford to store locally the whole timezone database – just to correctly determine the local time for a single timezone.

Therefore, IoT devices need to access an Internet based service to get the correct timezone information, just as they get UTC time updates using NTP. In other words, those IoT devices effectively outsource the timezone database management.

Internet service for providing the timezone information

An Internet service, for providing the correct tz_string corresponding to a timezone name, needs to keep the timezone database up to date at all times.

I implemented the internet service as follows.

  1. A machine, running an Ubuntu 18.04 installation with a webserver, is used.
  2. The Internet service is implmented as a small WSGI-based website. It uses the database mentioned below.
  3. A script scans the /usr/share/zoneinfo contents and creates a small database for translating timezone names into the corresponding tz_string values.
  4. There is a mechanism for invoking the above script and restarting the web server each time the tzdata package is updated/installed/re-installed.

Show us the code!

The GitHub project tddpirate/tzdata2tzstring includes redacted versions of both an implementation of the above website and a sample ESP8266 client.

Credits

I wish to thank the Python Israel Telegram group members for advice about selecting a Python WSGI framework. I ended up selecting Falcon because benchmarks indicated that it is faster than Flask and Bottle.

The members of the לינוקס Telegram group deserve thanks, too. They helped me find the mechanism for appending my own postprocessing scripts after a Debian/Ubuntu package installation or upgrade.

Get DisplayLink to work on Lenovo Y700 after upgrade to Debian Buster

Hardware

Laptop:
  • Lenovo Ideapad Y700
Graphics cards (in my system, they are configured to work in hybrid mode):
  • GeForce GTX 960M – NVIDIA Corporation GM107M (rev ff)
  • Intel Corporation HD Graphics 530 (rev 06)
Docking station with DisplayLink support:
  • ThinkPad Basic USB 3.0 Dock, Model No. DL3700-ESS
    It is connected to the laptop via an USB3 port and has its own power supply.

Using the NVIDIA graphics card

The following point is probably irrelevant to DisplayLink usage and problems. However it is part of my environment and I mention it for completeness sake.

The laptop is configured, as instructed by https://wiki.debian.org/Bumblebee, to work with the Intel graphics card. The NVIDIA card is used by applications running under optirun. I had to modify /etc/bumblebee/bumblebee.conf to use KernelDriver=nvidia-current rather than KernelDriver=nvidia.

Connecting two additional displays to the laptop

To install the appropriate driver:

  • git clone https://github.com/AdnanHodzic/displaylink-debian.git
  • Follow the instructions in README.md

When everything works properly, three displays are identified by xrandr -q | egrep axis as follows:

  • eDP-1 – laptop’s display
  • HDMI-1 – external display connected via laptop’s HDMI port
  • DVI-I-1-1 – external display connected via DisplayLink on docking port

Note that those displays could have different identifiers (such as DP1 or eDP1) in your system.

After starting the X-Window, configure the displays using:

  1. xrandr –output HDMI-1 –primary
  2. xrandr –output eDP-1 –mode 1360×768 –right-of HDMI-1
  3. sleep 1     # without it, the following display was not properly configured.
  4. xrandr –output DVI-I-1-1 –left-of HDMI-1

You probably want to add those commands to your ~/.xinitrc.
I chose the 1360×768 mode to have the same DPI in all attached displays.

Problems when upgrading from Debian 9 (Stretch) to Debian 10 (Buster)

The above setup worked under Debian 9 (Stretch).
However, after upgrade to Debian 10 (Buster) following the instructions in Release Notes for Debian 10 (buster), 64-bit PC, chapter 4, either the X-Window server did not work or the display connected via the docking station exhibited misconfiguration.

I got it to work as follows:

  1. Update your displaylink-debian clone to the most recent commit using
    git pull
  2. If the most recent commit did not work for you, try:
    git checkout fcb6ce5bc36c774af2d7f792842bcd2ede9c7483
    as this commit worked for me after performing the following steps.
  3. Reinstall the driver by running displaylink-debian.sh and following the instructions in README.md.
  4. Finally, replace the contents of the file /etc/X11/xorg.conf.d/20-displaylink.conf, installed by the above instructions by the following:
    Section "ServerLayout"
        Identifier "layout"
        Screen 0 "Intel Graphics"
        Inactive "nvidia"
    EndSection
    
    Section "Device"
        Identifier "intel"
        Driver "modesetting"
        Option "PageFlip" "false"
        Option "AccelMethod" "None"
    EndSection
    
    Section "Screen"
        Identifier "intel"
        Device "intel"
    EndSection
    
    Section "Device"
        Identifier "nvidia"
        Driver "nvidia"
        Option "ConstrainCursor" "off"
    EndSection
    
    Section "Screen"
        Identifier "nvidia"
        Device "nvidia"
        Option "AllowEmptyInitialConfiguration" "on"
        Option "IgnoreDisplayDevices" "CRT"
    EndSection
    
    Section "Device"
        Identifier "Intel Graphics"
        Driver "modesetting"
        Option "VSync" "false"
    EndSection
    
    Section "Screen"
        Identifier "Intel Graphics"
        Device "Intel Graphics"
    EndSection
  5. You need to restart the X-Server (I restarted the entire laptop to be on the safe side).

See GitHub issue: AdnanHodzic/displaylink-debian, Debian buster #308 for a similar bug report.

Credits

I wish to thank Boris Shtrasman for reviewing a draft of this post and providing a lot of feedback. Of course, any remaining inaccuracies in this post are my sole responsibility.

Anonymizing datasets for machine learning

Preface

All of us are familiar with the idea of anonymizing datasets to get rid of personally-identifiable information, in order to enable data mining while preserving (as much as possible) the privacy of the people whose data was collected. The basic idea is to modify names, ID numbers (Social Security Number in USA), home addresses, birthdays, IP addresses and similar information. Sometimes, one needs to get rid also of information about age/gender/nationality/ethnicity.

This method was subjected to a lot of research and it is easy to find, with the help of search engines, relevant papers and articles. See Bibliography for examples.

However, there is also another transformation of datasets. Unlike anonymization, as described above, this transformation is not about privacy preservation. It is about hiding the nature of the data being processed. Lacking a better term, we’ll use the term anonymization also for this transformation.

One possible application for this kind of anonymization is when one develops a revolutionary model for predicting the future behavior of the stock exchanges of the world by following various economic indicators and other publicly available time-dependent data sources.

In such an endeavor, the developer typically has gathered a lot of data, and wants to use it to train his revolutionary machine learning model. Since he cannot afford to build his own data center, he rents a lot of computing power in one of the cloud providers.

However, he does not want to take the risk of an unscrupulous employee of the cloud provider stealing his secret data or model and using it for his own benefit. He also wants to reduce the damage if a black hat hacker breaks into his rented computers.

Some users might want to process information, which interests governments such as the Chinese government. Those governments have the resources to break into cloud computers.

The classical way to mitigate such risks is to encrypt/recode/scramble (henceforth, I’ll refer to all those operations as encryption) the data being uploaded to the cloud. However, this encryption must be done in such a way that the data is still suitable for training the model. In addition, when running the model for making a prediction, the raw model’s results need to be generated in an encrypted form, for decryption in the developer’s on-premises computer/s (to which I will refer as a workstation henceforth). From this point on, we’ll use the terms anonymization and encryption interchangeably.

When looking for relevant research on the second kind of anonymization, I did not easily find relevant information. It motivated me to write this article.

Glossary

The following symbols are described in order of their appearance in text.

    • M: the transfer function of a machine learning system.
    • A: the argument of M – the data used by a machine learning system to make a prediction.
    • a_j: the j^{th} element of A.
    • P: the value of M(A) i.e. the prediction that the machine learning system makes when presented with data A.
    • p_k: the k^{th} element of P.
    • I: the identity function. For all x, I(x) = x.
    • F^{-1}(x) is the inverse of F(x), for any function F(x): for all relevant x, F^{-1}(F(x)) \equiv x \equiv F(F^{-1}(x)).
    • Functional composition: for all relevant x, (F_1 \circ F_2)(x) \equiv F_1(F_2(x)). For example, F^{-1} \circ F \equiv I \equiv F \circ F^{-1}.
    • E_a(A): a function which encrypts the argument A. Its inverse is denoted by E^{-1}_a(A'), which decrypts A', an encrypted version of the argument A.
    • D_p(P'): a function which decrypts the encrypted prediction P'. Its inverse is denoted by D^{-1}_p(P), which encrypts the prediction P.

Architecture of machine learning systems

A machine learning system is used to approximate a function M, which makes a prediction (or classification or whatever) P, given the n-tuple A which packs together several argument values:

\displaystyle{}P = M(A)

where:

\displaystyle{}A = (a_1, a_2, \ldots, a_m)

is the argument, and

\displaystyle{}P = (p_1, p_2, \ldots, p_n)

is the prediction.

The values a_j of the argument and p_k of the prediction can be of any data type and they are not limited to scalars. This is why a n-tuple notation is used rather than a vector notation.

Examples of machine learning system applications:

  • Picture classification. When presented with a picture of an animal, the system would tell how likely is the animal to be a dog, a cat or a horse. The system is trained by presenting it several pictures together with a label identifying the animal shown in the picture.
  • Prediction of the next few values of a time series, such as the numbers which describe the weather at a particular location. The system is trained by using relevant historical information.

Machine learning systems are sometimes implemented using neural networks. Neural networks have the property that a sufficiently large neural network can be trained to approximate any function, which meets certain reasonable conditions.

A machine learning system is trained to implement a good approximation of the function M by processing several 2-tuples of (A_i, P_i), which associate each prediction – the desired value of the function (which is usually a n-tuple) – with the corresponding argument value (which is usually a n-tuple).

The training process is very computationally intensive, so people often resort to cloud computing facilities, as said above.

Architecture of anonymized machine learning systems

When an user does not want to let the cloud provider know what he is doing, one possible approach is to train the model using encrypted data streams, so that the model’s outputs are encrypted as well. The data streams are encrypted on the user’s workstation. The workstation is used also to decrypt the model’s predictions.

The whole system can be described using the following formulae.

Original system:

\displaystyle{}P = M(A)

We add identity functions before and after M:

\displaystyle{}P = I \circ M \circ I(A) = I(M(I(A)))

The next step is to decompose the identity functions into pairs of a function and its inverse. The functions being used perform encryption and decryption.

\displaystyle{}P = (D_p \circ D_p^{-1}) \circ M \circ (E_a^{-1} \circ E_a(A))

where E_a(A) encrypts the argument A and D_p(P') decrypts the prediction P'.

Now we rearrange parentheses as follows:

\displaystyle{}P = D_p \circ (D_p^{-1} \circ M \circ E_a^{-1}) \circ E_a(A)

Now the system can be decomposed into three parts, which perform the following operations:

  1. Encrypt the argument A: \displaystyle{}A' = E_a(A)
  2. Actual encrypted machine learning system: \displaystyle{}P' = D_p^{-1} \circ M \circ E_a^{-1}(A') = M'(A')
  3. Decrypt the encrypted prediction P': \displaystyle{}P = D_p(P')

where A' and P' are the encrypted argument and prediction respectively.

The functions E_a(A) and D_p(P') need to be invertible, as their inverses are part of the function being approximated by the learning machine model M', which is the second part of the system, and is the one actually run on the cloud provider’s computers.

The first and third parts are implemented on the user’s workstation. The typical implementation relies upon keys and scrambling formulae.

Two more requirements are:

  • The machine learning model P' = M'(A') is to be implemented using a technology, which is sufficiently sophisticated to embed also nonlinear and invertible functions in the loss function used to evaluate it.
  • There is sufficient training and validation data to train the model, which embeds including nonlinear invertible functions.

Types of data

When dealing with anonymization of data, one has to consider separately each of the following data types.

  • Variable names
  • Numerical variables
  • Ordinal variables
  • Categorical variables
  • Time based variables

Variable names

Variable names are used for naming the various variables which are part of the argument and prediction of the machine learning model. They are used for inspecting the argument’s data streams and for retrieving relevant parts of the model’s prediction.

Of course, the cloud provider should not be exposed to the true names of the variables.

Variable names can be converted into meaningless strings. For example, by using standard password scrambling algorithms, such as salt+md5sum.

The user’s workstation would have tables for mapping among the true variable names and the names used by the model and databases in the cloud.

Numerical variables

Numerical variables can be transformed using invertible functions.

Also, if the argument A has several numerical elements (including time based elements), one could treat them as a single vector and transform it using an invertible matrix.

Mathematically, it could look as follows:

\vec {A_v'} = E_{av}(\vec {A_v}) = E_{matrix} \vec {A_v}

where:

  • \vec {A_v} is the restriction of A to numerical variables.
  • \vec {A_v'} is the encrypted version of\vec {A_v}.
  • E_{av} is the argument’s encryption function, restricted to numerical elements of the argument A.
  • E_{matrix} is an invertible transformation matrix.

Invertible scalar functions could be applied to A_v‘s elements before and after the matrix transformation.

If the argument has also an element, which is a categorical variable, one could use a different transformation for each value of the categorical variable.

Ordinal variables

The values of the ordinal variables could be permuted. The learning model will implicitly embed the inverse permutation.

Categorical variables

Shuffling categories is not enough, because categories could be identified by their frequencies (like application of Zipf’s law to decrypting substitution ciphers).

The following approach is probably not universally applicable.

Categories could be anonymized by splitting a frequently occurring category into several subcategories. The learning model will give a different prediction for each subcategory. The different predictions will have to be somehow combined in the user’s workstation.

This approach also requires the model to be formulated in such a way that the final prediction can be derived by combining the predictions corresponding to the subcategories of split categories.

Time based variables

When anonymizing time based variables, one needs to transform the argument to hide any dependence it has upon weekly, monthly, seasonal or yearly cycles. One needs also to hide dependencies upon well-known events, such as volcano eruptions or rising CO_2 concentration in air.

Otherwise, it would be possible to identify dates by looking for correlations with well-known timings.

One possible way to hide those dependencies is to apply an ARIMA forecasting model to the argument.

Bibliography

The following articles are about getting rid of personally-identifiable information in order to preserve privacy.

      1. https://en.wikipedia.org/wiki/Data_anonymization
        • Generalization.
        • Perturbation.
      2. http://blog.datasift.com/2015/04/09/techniques-to-anonymize-human-data/
        The methods proposed by this article could interfere with machine learning, except for sufficiently small perturbations.
      3. https://www.elastic.co/blog/anonymize-it-the-general-purpose-tool-for-data-privacy-used-by-the-elastic-machine-learning-team
        • Suppression of fields.
        • Generation of semantically valid artificial data (such as strings). There is a Python module – Faker – which is good for faking names, addresses and random (lorem ipsum) text.
        • The methods, mentioned in this article, cannot anonymize numeric data.
      4. https://docs.splunk.com/Documentation/Splunk/7.2.3/Troubleshooting/AnonymizedatasamplestosendtoSupport
        Anonymization of data such as usernames, IP addresses, domain names.
      5. https://www.oreilly.com/ideas/anonymize-data-limits
        Human data cannot really be anonymized.
      6. https://www.intel.co.kr/content/dam/www/public/us/en/documents/best-practices/enhancing-cloud-security-using-data-anonymization.pdf
        Several methods for anonymizing data such as identifying information of humans, IP addresses, etc:

        • Hiding
        • Hashing
        • Permutation
        • Shift
        • Enumeration
        • Truncation
        • Prefix-preserving
      7. https://ieeexplore.ieee.org/abstract/document/6470603
        Usage of MapReduce to anonymize data.

      Addendum

      After finishing the first draft of this post, I was informed of the following.

      Credits

      I wish to thank Boris Shtrasman for reviewing a draft of this post and providing a lot of feedback. Of course, any remaining inaccuracies in this post are my sole responsibility.