Anonymizing datasets for machine learning


All of us are familiar with the idea of anonymizing datasets to get rid of personally-identifiable information, in order to enable data mining while preserving (as much as possible) the privacy of the people whose data was collected. The basic idea is to modify names, ID numbers (Social Security Number in USA), home addresses, birthdays, IP addresses and similar information. Sometimes, one needs to get rid also of information about age/gender/nationality/ethnicity.

This method was subjected to a lot of research and it is easy to find, with the help of search engines, relevant papers and articles. See Bibliography for examples.

However, there is also another transformation of datasets. Unlike anonymization, as described above, this transformation is not about privacy preservation. It is about hiding the nature of the data being processed. Lacking a better term, we’ll use the term anonymization also for this transformation.

One possible application for this kind of anonymization is when one develops a revolutionary model for predicting the future behavior of the stock exchanges of the world by following various economic indicators and other publicly available time-dependent data sources.

In such an endeavor, the developer typically has gathered a lot of data, and wants to use it to train his revolutionary machine learning model. Since he cannot afford to build his own data center, he rents a lot of computing power in one of the cloud providers.

However, he does not want to take the risk of an unscrupulous employee of the cloud provider stealing his secret data or model and using it for his own benefit. He also wants to reduce the damage if a black hat hacker breaks into his rented computers.

Some users might want to process information, which interests governments such as the Chinese government. Those governments have the resources to break into cloud computers.

The classical way to mitigate such risks is to encrypt/recode/scramble (henceforth, I’ll refer to all those operations as encryption) the data being uploaded to the cloud. However, this encryption must be done in such a way that the data is still suitable for training the model. In addition, when running the model for making a prediction, the raw model’s results need to be generated in an encrypted form, for decryption in the developer’s on-premises computer/s (to which I will refer as a workstation henceforth). From this point on, we’ll use the terms anonymization and encryption interchangeably.

When looking for relevant research on the second kind of anonymization, I did not easily find relevant information. It motivated me to write this article.


The following symbols are described in order of their appearance in text.

    • M: the transfer function of a machine learning system.
    • A: the argument of M – the data used by a machine learning system to make a prediction.
    • a_j: the j^{th} element of A.
    • P: the value of M(A) i.e. the prediction that the machine learning system makes when presented with data A.
    • p_k: the k^{th} element of P.
    • I: the identity function. For all x, I(x) = x.
    • F^{-1}(x) is the inverse of F(x), for any function F(x): for all relevant x, F^{-1}(F(x)) \equiv x \equiv F(F^{-1}(x)).
    • Functional composition: for all relevant x, (F_1 \circ F_2)(x) \equiv F_1(F_2(x)). For example, F^{-1} \circ F \equiv I \equiv F \circ F^{-1}.
    • E_a(A): a function which encrypts the argument A. Its inverse is denoted by E^{-1}_a(A'), which decrypts A', an encrypted version of the argument A.
    • D_p(P'): a function which decrypts the encrypted prediction P'. Its inverse is denoted by D^{-1}_p(P), which encrypts the prediction P.

Architecture of machine learning systems

A machine learning system is used to approximate a function M, which makes a prediction (or classification or whatever) P, given the n-tuple A which packs together several argument values:

\displaystyle{}P = M(A)


\displaystyle{}A = (a_1, a_2, \ldots, a_m)

is the argument, and

\displaystyle{}P = (p_1, p_2, \ldots, p_n)

is the prediction.

The values a_j of the argument and p_k of the prediction can be of any data type and they are not limited to scalars. This is why a n-tuple notation is used rather than a vector notation.

Examples of machine learning system applications:

  • Picture classification. When presented with a picture of an animal, the system would tell how likely is the animal to be a dog, a cat or a horse. The system is trained by presenting it several pictures together with a label identifying the animal shown in the picture.
  • Prediction of the next few values of a time series, such as the numbers which describe the weather at a particular location. The system is trained by using relevant historical information.

Machine learning systems are sometimes implemented using neural networks. Neural networks have the property that a sufficiently large neural network can be trained to approximate any function, which meets certain reasonable conditions.

A machine learning system is trained to implement a good approximation of the function M by processing several 2-tuples of (A_i, P_i), which associate each prediction – the desired value of the function (which is usually a n-tuple) – with the corresponding argument value (which is usually a n-tuple).

The training process is very computationally intensive, so people often resort to cloud computing facilities, as said above.

Architecture of anonymized machine learning systems

When an user does not want to let the cloud provider know what he is doing, one possible approach is to train the model using encrypted data streams, so that the model’s outputs are encrypted as well. The data streams are encrypted on the user’s workstation. The workstation is used also to decrypt the model’s predictions.

The whole system can be described using the following formulae.

Original system:

\displaystyle{}P = M(A)

We add identity functions before and after M:

\displaystyle{}P = I \circ M \circ I(A) = I(M(I(A)))

The next step is to decompose the identity functions into pairs of a function and its inverse. The functions being used perform encryption and decryption.

\displaystyle{}P = (D_p \circ D_p^{-1}) \circ M \circ (E_a^{-1} \circ E_a(A))

where E_a(A) encrypts the argument A and D_p(P') decrypts the prediction P'.

Now we rearrange parentheses as follows:

\displaystyle{}P = D_p \circ (D_p^{-1} \circ M \circ E_a^{-1}) \circ E_a(A)

Now the system can be decomposed into three parts, which perform the following operations:

  1. Encrypt the argument A: \displaystyle{}A' = E_a(A)
  2. Actual encrypted machine learning system: \displaystyle{}P' = D_p^{-1} \circ M \circ E_a^{-1}(A') = M'(A')
  3. Decrypt the encrypted prediction P': \displaystyle{}P = D_p(P')

where A' and P' are the encrypted argument and prediction respectively.

The functions E_a(A) and D_p(P') need to be invertible, as their inverses are part of the function being approximated by the learning machine model M', which is the second part of the system, and is the one actually run on the cloud provider’s computers.

The first and third parts are implemented on the user’s workstation. The typical implementation relies upon keys and scrambling formulae.

Two more requirements are:

  • The machine learning model P' = M'(A') is to be implemented using a technology, which is sufficiently sophisticated to embed also nonlinear and invertible functions in the loss function used to evaluate it.
  • There is sufficient training and validation data to train the model, which embeds including nonlinear invertible functions.

Types of data

When dealing with anonymization of data, one has to consider separately each of the following data types.

  • Variable names
  • Numerical variables
  • Ordinal variables
  • Categorical variables
  • Time based variables

Variable names

Variable names are used for naming the various variables which are part of the argument and prediction of the machine learning model. They are used for inspecting the argument’s data streams and for retrieving relevant parts of the model’s prediction.

Of course, the cloud provider should not be exposed to the true names of the variables.

Variable names can be converted into meaningless strings. For example, by using standard password scrambling algorithms, such as salt+md5sum.

The user’s workstation would have tables for mapping among the true variable names and the names used by the model and databases in the cloud.

Numerical variables

Numerical variables can be transformed using invertible functions.

Also, if the argument A has several numerical elements (including time based elements), one could treat them as a single vector and transform it using an invertible matrix.

Mathematically, it could look as follows:

\vec {A_v'} = E_{av}(\vec {A_v}) = E_{matrix} \vec {A_v}


  • \vec {A_v} is the restriction of A to numerical variables.
  • \vec {A_v'} is the encrypted version of\vec {A_v}.
  • E_{av} is the argument’s encryption function, restricted to numerical elements of the argument A.
  • E_{matrix} is an invertible transformation matrix.

Invertible scalar functions could be applied to A_v‘s elements before and after the matrix transformation.

If the argument has also an element, which is a categorical variable, one could use a different transformation for each value of the categorical variable.

Ordinal variables

The values of the ordinal variables could be permuted. The learning model will implicitly embed the inverse permutation.

Categorical variables

Shuffling categories is not enough, because categories could be identified by their frequencies (like application of Zipf’s law to decrypting substitution ciphers).

The following approach is probably not universally applicable.

Categories could be anonymized by splitting a frequently occurring category into several subcategories. The learning model will give a different prediction for each subcategory. The different predictions will have to be somehow combined in the user’s workstation.

This approach also requires the model to be formulated in such a way that the final prediction can be derived by combining the predictions corresponding to the subcategories of split categories.

Time based variables

When anonymizing time based variables, one needs to transform the argument to hide any dependence it has upon weekly, monthly, seasonal or yearly cycles. One needs also to hide dependencies upon well-known events, such as volcano eruptions or rising CO_2 concentration in air.

Otherwise, it would be possible to identify dates by looking for correlations with well-known timings.

One possible way to hide those dependencies is to apply an ARIMA forecasting model to the argument.


The following articles are about getting rid of personally-identifiable information in order to preserve privacy.

        • Generalization.
        • Perturbation.
        The methods proposed by this article could interfere with machine learning, except for sufficiently small perturbations.
        • Suppression of fields.
        • Generation of semantically valid artificial data (such as strings). There is a Python module – Faker – which is good for faking names, addresses and random (lorem ipsum) text.
        • The methods, mentioned in this article, cannot anonymize numeric data.
        Anonymization of data such as usernames, IP addresses, domain names.
        Human data cannot really be anonymized.
        Several methods for anonymizing data such as identifying information of humans, IP addresses, etc:

        • Hiding
        • Hashing
        • Permutation
        • Shift
        • Enumeration
        • Truncation
        • Prefix-preserving
        Usage of MapReduce to anonymize data.


      After finishing the first draft of this post, I was informed of the following.


      I wish to thank Boris Shtrasman for reviewing a draft of this post and providing a lot of feedback. Of course, any remaining inaccuracies in this post are my sole responsibility.

Author: Omer Zak

I am deaf since birth. I played with big computers which eat punched cards and spew out printouts since age 12. Ever since they became available, I work and play with desktop size computers which eat keyboard keypresses and spew out display pixels. Among other things, I developed software which helped the deaf in Israel use the telephone network, by means of home computers equipped with modems. Several years later, I developed Hebrew localizations for some cellular phones, which helped the deaf in Israel utilize the cellular phone networks. I am interested in entrepreneurship, Science Fiction and making the world more accessible to people with disabilities.

One thought on “Anonymizing datasets for machine learning”

  1. First I would like to thank you for the kind refernce. I would like to point out that I have not verified the correctness of your statements (Mathematically speaking). Nor do I agree with all the aspected raised in this articale. Some of my objections had been also raised by the authors in your bibliography.

    Regarding the term anonymization, I storngly belive that we as an industry should not use such phrases and terms, which include anonymization (or de identification). I say that as I belive if we can extract infromation from the dataset or have a symmetric relation to a degree this data was not anonymized or deidnetified.

    As a counter example for the success of anonymization, we can take a shuffeling method used during 20’th century to check for STD among GIs (which may be a myth). Where a group of GIs’ swabs and samples had been mixed toghter to see if there are any sick among the group. If possitive results had been found they would start a treatment. This is a similar thing to what we do when do categorization , where we select some features from a data set. but in both cases we had a way to come close enough to the subject or the group.

    In a more recent cases such as this large dateset had been reidentified. That resource have an important quote “Data can be either useful or perfectly anonymous but never both.” which I strongly agree with.

    When we provide a substituion for a feature which does not have enough errors, we have not made data more anonymized we just use a different id. If I replace the name “Chesla Manning” by 6c90d243392db6b57ff4ad352fe099d0 it did not become any less intrusive. In many cases when deidnetification is applied the subjects and policy makers belive the identifing information had been removed and does not hold private information, but in reality there is still away to get that data and perform reidentification to a some degree.

    A different subject is the use of a threat model that is mostly against illicit actors, in these days of data retention practises,I think we should also consider actors that have a full rights for the data or computation facility. For instance a service providers which may reuse that data, or a subcontractor who gets user X does a study and got this network. Sometimes we do not even consider meta data as a reviling dataset, but in reality that data is the new oil. This said without considering the Ghosts model which may be implemented in the future by some computing service providers.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.