## Preface

All of us are familiar with the idea of anonymizing datasets to get rid of personally-identifiable information, in order to enable data mining while preserving (as much as possible) the privacy of the people whose data was collected. The basic idea is to modify names, ID numbers (Social Security Number in USA), home addresses, birthdays, IP addresses and similar information. Sometimes, one needs to get rid also of information about age/gender/nationality/ethnicity.

This method was subjected to a lot of research and it is easy to find, with the help of search engines, relevant papers and articles. See Bibliography for examples.

However, there is also another transformation of datasets. Unlike anonymization, as described above, this transformation is not about privacy preservation. It is about hiding the nature of the data being processed. Lacking a better term, we’ll use the term anonymization also for this transformation.

One possible application for this kind of anonymization is when one develops a revolutionary model for predicting the future behavior of the stock exchanges of the world by following various economic indicators and other publicly available time-dependent data sources.

In such an endeavor, the developer typically has gathered a lot of data, and wants to use it to train his revolutionary machine learning model. Since he cannot afford to build his own data center, he rents a lot of computing power in one of the cloud providers.

However, he does not want to take the risk of an unscrupulous employee of the cloud provider stealing his secret data or model and using it for his own benefit. He also wants to reduce the damage if a black hat hacker breaks into his rented computers.

Some users might want to process information, which interests governments such as the Chinese government. Those governments have the resources to break into cloud computers.

The classical way to mitigate such risks is to encrypt/recode/scramble (henceforth, I’ll refer to all those operations as encryption) the data being uploaded to the cloud. However, this encryption must be done in such a way that the data is still suitable for training the model. In addition, when running the model for making a prediction, the raw model’s results need to be generated in an encrypted form, for decryption in the developer’s on-premises computer/s (to which I will refer as a workstation henceforth). From this point on, we’ll use the terms anonymization and encryption interchangeably.

When looking for relevant research on the second kind of anonymization, I did not easily find relevant information. It motivated me to write this article.

## Glossary

The following symbols are described in order of their appearance in text.

• $M$: the transfer function of a machine learning system.
• $A$: the argument of $M$ – the data used by a machine learning system to make a prediction.
• $a_j$: the $j^{th}$ element of $A$.
• $P$: the value of $M(A)$ i.e. the prediction that the machine learning system makes when presented with data $A$.
• $p_k$: the $k^{th}$ element of $P$.
• $I$: the identity function. For all $x$, $I(x) = x$.
• $F^{-1}(x)$ is the inverse of $F(x)$, for any function $F(x)$: for all relevant $x$, $F^{-1}(F(x)) \equiv x \equiv F(F^{-1}(x))$.
• Functional composition: for all relevant $x$, $(F_1 \circ F_2)(x) \equiv F_1(F_2(x))$. For example, $F^{-1} \circ F \equiv I \equiv F \circ F^{-1}$.
• $E_a(A)$: a function which encrypts the argument $A$. Its inverse is denoted by $E^{-1}_a(A')$, which decrypts $A'$, an encrypted version of the argument $A$.
• $D_p(P')$: a function which decrypts the encrypted prediction $P'$. Its inverse is denoted by $D^{-1}_p(P)$, which encrypts the prediction $P$.

## Architecture of machine learning systems

A machine learning system is used to approximate a function $M$, which makes a prediction (or classification or whatever) $P$, given the n-tuple $A$ which packs together several argument values:

$\displaystyle{}P = M(A)$

where:

$\displaystyle{}A = (a_1, a_2, \ldots, a_m)$

is the argument, and

$\displaystyle{}P = (p_1, p_2, \ldots, p_n)$

is the prediction.

The values $a_j$ of the argument and $p_k$ of the prediction can be of any data type and they are not limited to scalars. This is why a n-tuple notation is used rather than a vector notation.

Examples of machine learning system applications:

• Picture classification. When presented with a picture of an animal, the system would tell how likely is the animal to be a dog, a cat or a horse. The system is trained by presenting it several pictures together with a label identifying the animal shown in the picture.
• Prediction of the next few values of a time series, such as the numbers which describe the weather at a particular location. The system is trained by using relevant historical information.

Machine learning systems are sometimes implemented using neural networks. Neural networks have the property that a sufficiently large neural network can be trained to approximate any function, which meets certain reasonable conditions.

A machine learning system is trained to implement a good approximation of the function $M$ by processing several 2-tuples of $(A_i, P_i)$, which associate each prediction – the desired value of the function (which is usually a n-tuple) – with the corresponding argument value (which is usually a n-tuple).

The training process is very computationally intensive, so people often resort to cloud computing facilities, as said above.

## Architecture of anonymized machine learning systems

When an user does not want to let the cloud provider know what he is doing, one possible approach is to train the model using encrypted data streams, so that the model’s outputs are encrypted as well. The data streams are encrypted on the user’s workstation. The workstation is used also to decrypt the model’s predictions.

The whole system can be described using the following formulae.

Original system:

$\displaystyle{}P = M(A)$

We add identity functions before and after $M$:

$\displaystyle{}P = I \circ M \circ I(A) = I(M(I(A)))$

The next step is to decompose the identity functions into pairs of a function and its inverse. The functions being used perform encryption and decryption.

$\displaystyle{}P = (D_p \circ D_p^{-1}) \circ M \circ (E_a^{-1} \circ E_a(A))$

where $E_a(A)$ encrypts the argument $A$ and $D_p(P')$ decrypts the prediction $P'$.

Now we rearrange parentheses as follows:

$\displaystyle{}P = D_p \circ (D_p^{-1} \circ M \circ E_a^{-1}) \circ E_a(A)$

Now the system can be decomposed into three parts, which perform the following operations:

1. Encrypt the argument $A$: $\displaystyle{}A' = E_a(A)$
2. Actual encrypted machine learning system: $\displaystyle{}P' = D_p^{-1} \circ M \circ E_a^{-1}(A') = M'(A')$
3. Decrypt the encrypted prediction $P'$: $\displaystyle{}P = D_p(P')$

where $A'$ and $P'$ are the encrypted argument and prediction respectively.

The functions $E_a(A)$ and $D_p(P')$ need to be invertible, as their inverses are part of the function being approximated by the learning machine model $M'$, which is the second part of the system, and is the one actually run on the cloud provider’s computers.

The first and third parts are implemented on the user’s workstation. The typical implementation relies upon keys and scrambling formulae.

Two more requirements are:

• The machine learning model $P' = M'(A')$ is to be implemented using a technology, which is sufficiently sophisticated to embed also nonlinear and invertible functions in the loss function used to evaluate it.
• There is sufficient training and validation data to train the model, which embeds including nonlinear invertible functions.

## Types of data

When dealing with anonymization of data, one has to consider separately each of the following data types.

• Variable names
• Numerical variables
• Ordinal variables
• Categorical variables
• Time based variables

### Variable names

Variable names are used for naming the various variables which are part of the argument and prediction of the machine learning model. They are used for inspecting the argument’s data streams and for retrieving relevant parts of the model’s prediction.

Of course, the cloud provider should not be exposed to the true names of the variables.

Variable names can be converted into meaningless strings. For example, by using standard password scrambling algorithms, such as salt+md5sum.

The user’s workstation would have tables for mapping among the true variable names and the names used by the model and databases in the cloud.

### Numerical variables

Numerical variables can be transformed using invertible functions.

Also, if the argument $A$ has several numerical elements (including time based elements), one could treat them as a single vector and transform it using an invertible matrix.

Mathematically, it could look as follows:

$\vec {A_v'} = E_{av}(\vec {A_v}) = E_{matrix} \vec {A_v}$

where:

• $\vec {A_v}$ is the restriction of A to numerical variables.
• $\vec {A_v'}$ is the encrypted version of$\vec {A_v}$.
• $E_{av}$ is the argument’s encryption function, restricted to numerical elements of the argument $A$.
• $E_{matrix}$ is an invertible transformation matrix.

Invertible scalar functions could be applied to $A_v$‘s elements before and after the matrix transformation.

If the argument has also an element, which is a categorical variable, one could use a different transformation for each value of the categorical variable.

### Ordinal variables

The values of the ordinal variables could be permuted. The learning model will implicitly embed the inverse permutation.

### Categorical variables

Shuffling categories is not enough, because categories could be identified by their frequencies (like application of Zipf’s law to decrypting substitution ciphers).

The following approach is probably not universally applicable.

Categories could be anonymized by splitting a frequently occurring category into several subcategories. The learning model will give a different prediction for each subcategory. The different predictions will have to be somehow combined in the user’s workstation.

This approach also requires the model to be formulated in such a way that the final prediction can be derived by combining the predictions corresponding to the subcategories of split categories.

### Time based variables

When anonymizing time based variables, one needs to transform the argument to hide any dependence it has upon weekly, monthly, seasonal or yearly cycles. One needs also to hide dependencies upon well-known events, such as volcano eruptions or rising $CO_2$ concentration in air.

Otherwise, it would be possible to identify dates by looking for correlations with well-known timings.

One possible way to hide those dependencies is to apply an ARIMA forecasting model to the argument.

## Bibliography

The following articles are about getting rid of personally-identifiable information in order to preserve privacy.

1. https://en.wikipedia.org/wiki/Data_anonymization
• Generalization.
• Perturbation.
2. http://blog.datasift.com/2015/04/09/techniques-to-anonymize-human-data/
The methods proposed by this article could interfere with machine learning, except for sufficiently small perturbations.
3. https://www.elastic.co/blog/anonymize-it-the-general-purpose-tool-for-data-privacy-used-by-the-elastic-machine-learning-team
• Suppression of fields.
• Generation of semantically valid artificial data (such as strings). There is a Python module – Faker – which is good for faking names, addresses and random (lorem ipsum) text.
4. https://docs.splunk.com/Documentation/Splunk/7.2.3/Troubleshooting/AnonymizedatasamplestosendtoSupport
5. https://www.oreilly.com/ideas/anonymize-data-limits
Human data cannot really be anonymized.
6. https://www.intel.co.kr/content/dam/www/public/us/en/documents/best-practices/enhancing-cloud-security-using-data-anonymization.pdf
Several methods for anonymizing data such as identifying information of humans, IP addresses, etc:

• Hiding
• Hashing
• Permutation
• Shift
• Enumeration
• Truncation
• Prefix-preserving
7. https://ieeexplore.ieee.org/abstract/document/6470603
Usage of MapReduce to anonymize data.

After finishing the first draft of this post, I was informed of the following.

## Credits

I wish to thank Boris Shtrasman for reviewing a draft of this post and providing a lot of feedback. Of course, any remaining inaccuracies in this post are my sole responsibility.

## Security and Obscurity

If you do not know the password but know how to use the password to gain access to something that was secured using this password, then this is security by obscurity.

On the other hand, if you know the password but do not know how to use the password, then this is obscurity by security.

(Sources of inspiration: The Butterfly DreamCategory Theory’s reversal of rows.)

## Android unit testing and Mazer Rackham

כבר אמר מייזר רקהאם (“המשחק של אנדר”) שאין מורה כמו האוייב.
נזכרתי בזה במהלך המלחמה שלי בבניית בדיקות יחידה לאפליקציה לאנדרואיד בסביבת הבדיקה של API 24 והלאה.
Mazer Rackham (“Ender’s Game”) said: There is no teacher but the enemy.
I was reminded of this during my war of building unit tests for an Android application in the testing environment of API 24 and later.

## How to manage money before registration of the Hamakor nonprofit

Hamakor – an Israeli nonprofit for Free Software and Open Code was founded at 2003 and one of its goals is to help deal with money when organizing various activities around Free Software.

Before founding Hamakor, people resorted to other solutions. In one exhibition, we wanted to distribute Linux installation CD-ROMs to the crowd. However, we wanted those CD-ROMs to reach only people who have a serious intent to install Linux. A way to ensure it was to require them to donate money in exchange for their CD-ROM. However, if we collect money, we must issue receipts and do bookkeeping.

Truly Yours had the brilliant idea of inviting other nonprofits to send a representative to collect the donations and issue receipts. It was a win-win-win solution. We get someone to issue receipts and sweat over the bookkeeping. The nonprofits get money for their operations. The people, who paid for their CD-ROMs, treat them seriously.

One of the nonprofits was the Association of the Deaf in Israel.

## The 2010’s John Galt project – Artificial Photosynthesis

The 2010’s John Galt project – the most important technology to be developed this decade:

Artificial photosynthesis.

Technology for using solar energy to bind carbon dioxide with hydrogen from water, and convert them into carbohydrate fuels.

The technology will have the following benefits:

1. Solve the problem of global heating, thanks to removal of carbon dioxide from the atmosphere.
2. Make it possible to power everything using solar energy – directly during daytime, and indirectly (through burning the carbohydrate fuels) at night and bad weather.

The biggest drawback: drastic change in the economic power structure worldwide, because countries and companies with coal, oil or gas deposits would no longer have power over their clients.

פרויקט ג’ון גאלט של שנות ה-2010 – הטכנולוגיה החשובה ביותר לפיתוח בעשור הנוכחי:

פוטוסינתיזה מלאכותית.

טכנולוגיה לשימוש באנרגיית השמש כדי לקשור פחמן דו חמצני עם מימן מהמים, ולהמיר אותם לדלקי פחמן-מימן.

לטכנולוגיה כזו יש שני יתרונות גדולים:

1. פתרון לבעית ההתחממות הגלובלית, הודות להסרת פחמן דו חמצני מהאטמוספירה.
2. אפשרות להשתמש באנרגיית השמש כדי לספק כוח לכל צרכני החשמל – ישירות בשעות היום ובעקיפין (באמצעות שריפת דלקי פחמן-מימן) בלילה ובמזג אוויר סגרירי.

החסרון הגדול ביותר: שינוי דרסטי במבנה הכוחות הכלכליים בכל העולם, מכיוון שארצות וחברות שיש להן מרבצי פחם, נפט וגז, יאבדו את הכוח שלהן על לקוחותיהן.

## A proposed new language construct: do/undo

There are several cases, in which a program has to do operations A,B,C. If an error occurs while doing A, the program needs to undo A. If there is an error while doing B, the program needs to undo B then A. Similarly, an error in C requires undoing of C,B,A in this sequence.
Continue reading “A proposed new language construct: do/undo”

## First Contact Speculations

Given the announcements about finding more and more Earth-like planets around other Suns and potentially life-friendly regions in Jupiter and Saturn’s satellites, I would like to offer the following predictions.

• Actual First Contact has already happened. People do not discuss the distance between Roswell and Alamogordo, which is near the site of the first A-bomb explosion, and which happens to be only 145Km by air.
• Since First Contact among two cultures, which significantly differ in their technological sophistication, is usually harmful to the less technologically advanced culture, it was decided not to publicize this fact without preparing the Earth culture for those news.

Furthermore, it was decided that the Earth people are to be the ones to “discover” the alien culture, rather than vice versa.

• Several SciFi stories assumed that alien cultures would discover the existence of a technological Earth culture by radiation from its first A-bomb explosions. Arthur C. Clarke’s “2001 Space Odyssey” assumed that this would happen when a certain slab is excavated on the Moon. The Star Trek world assumed that the discovery would by detecting the radiation emitted by Zefram Cochrane‘s first FTL flight using a warp drive system.

However, I believe that the real “Zefram Cochrane” moment would be when the Earth culture proves its ability to defend life on Earth by deflecting or destroying killer asteroids.

Hence, my prediction is that once a dangerous asteroid has been detected and successfully neutralized, it will be announced that some Earth probe ship has at last found signs of alien life.

Now I am reading the book “Visual Thinking for Design” by Colin Ware (ISBN 978-0-12-370896-0).  The book starts by describing the way the brain processes visual information.  Essentially, the brain processes it chunk by chunk, where the chunks are separated by saccades (rapid eye movements).

Since my V2V project requires finding a way to translate auditory information into visual information for processing by the brain, I’m looking also for the corresponding information about the way a brain processes auditory information.

One question, which arose in my mind, as I am reading the aforementioned book is as follows.

Assume a hearing (or hard of hearing) lipreader, who follows a speech by listening and uses lipreading as an auxiliary aid to filter out environmental noises and other speakers.  Given that the lipreader’s eyes perform saccades as usual, are the saccades synchronized to times at which the speaker produces vowels rather than consonants?

And would the saccades still by synchronized to vowel production periods also for a deaf lipreader?

## How to exploit a revolutionary physical discovery?

Suppose you have the good fortune to make a stunning physical discovery.  However, you are a practical person and would like to put your discovery to a practical use.  Where would you look for problems, which your new discovery can solve?

1. Measure quantities more accurately than was possible before, or measure previously unmeasurable quantities (example: a sample/hold circuit based upon Josephson junctions).
2. Non-destructive tests.
3. Transport energy to long distances (such as superconductive materials).
4. Transportation, including space transportation (such as various levitation forces).
5. Improved electronic computers.
6. Manufacturing much stronger materials.
7. New energy sources (such as neutrino flows, if a good neutrino trap is ever developed).
8. Synthesis of novel chemical materials.
9. Separation of materials (to make very clean materials or simply to desalinize water).
10. “Molecular engineering” – construction of 3D integrated circuits (or simply, manufacturing materials having layered structure).
11. More efficient cooling (closer to Absolute Zero and/or easier to achieve a target temperature).

### מה אפשר לעשות עם תגלית פיסיקלית מהפכנית?

נניח שהתמזל מזלך וגילית תגלית פיסיקלית מהפכנית.  עם זאת, הינך אדם מעשי והיית רוצה לנצל את תגליתך למטרה מועילה כלשהי.  איפה תחפש אחר בעיות שהתגלית החדשה שלך יכולה לפתור?

1. פשוט למדוד דברים בצורה יותר מדוייקת ממה שהיה אפשר בעבר, או למדוד דברים שבכלל אי אפשר היה קודם למדדם (דוגמא: מעגל sample/hold המבוסס על צמתות Josephson).
2. בדיקות לא הרסניות.
3. העברת אנרגיה למרחקים גדולים (כמו חומרים על-מוליכים).
4. תחבורה, כולל תחבורה בחלל (כמו כוחות levitation למיניהם).
5. מחשבים אלקטרוניים משופרים.
6. ייצור חומרים חזקים ביותר.
7. מקורות אנרגיה חדשים (כגון שטף נאוטרינוס, אם יפתחו מלכודת-נאוטרינוס טובה).
8. סינתיזה של חומרים כימיים חדשים.
9. הפרדת חומרים (לייצור חומרים נקיים ביותר ו/או פשוט להתפלת מים).
10. “הנדסה מולקולרית” – בניית מעגלים משולבים תלת-מימדיים (או פשוט ייצור חומרים הבנויים שכבות-שכבות).
11. קירור יותר יעיל (יותר קרוב לאפס המוחלט ו/או יותר קל להגיע לטמפרטורת יעד מסוימת).

## What really prevents us from making 20 thousand NIS a month?

Those days there is a commotion in Israel due to a speech by Yair Lapid, the new Finance Minister, about Riki Cohen who earns 20 thousand NIS a month but cannot make ends meet due to various expenses of living in Israel.

Several Israelis do not reach income of 20 thousand NIS a month. Some of them nevertheless manage to make ends meet but their monthly savings are not enough to buy an apartment at reasonable time. Others do not succeed in making ends meet.

I propose that instead of being angry at Yair Lapid for his failure to know the situation of the lower socioeconomic classes in Israel – each one is to think, check and inquire what really prevents him from making 20 thousand NIS a month.

Usually the reason is a low-paying profession, and the blocking factor is the absence of ability to convert to another, better paying profession. But there are several other blocking factors, and it would be a swell idea if someone is to run a survey so that we’ll know why most Israelis cannot make 20 thousand NIS a month.

Several possible reasons:

• No funds or time for vocational change course, or the course is held in a place which requires 3 hours commutation from home to course and back each day.
• During high school years, I could not study, and after army service, while I can study, there is no time or appropriate organization.
• The rabbi or ADMO”R forbade me from studying “secular studies” lest I sin and deviate from the way of the Holy Torah.
• I am screwed because I am an Arab.
• There is a good job in my profession but there is no reasonable public transportation from my home to the workplace, and I cannot move to live near the workplace.
• My vocation is important for the community but due to some reason has a low pay (examples: primary school teacher, social worker, nurse, caregiver).
• I have children who need to be taken care of but the regular work hours in my vocation are crazy so I cannot work in my well-paying profession.
• I have a disability which causes employers not to want to give me a job even though I can do it well.
• I have a medical problem which prevents me from working.

All this – under the assumption that we are speaking about industrious people, who want to honorably support themselves, rather than be parasites living off at the public expense. This assumption is correct for most people, but we still need to filter out the parasites.