[Note: This is a translation of an article originally in French]
Definition
It is an operation that involves removing or replacing information (personal or other data) of different natures and origins, before transmitting them to third parties (as with the Open Data movement, outsourcing, etc). Put clearly, information is replaced by values which no longer carry any information, these values making it impossible to find the original information.
Anonymization is the most common generic term used in French, and is sometimes designated (depending on the motivation or the context) as:
- De-identification, depersonalization (especially for personal data),
- In English, we mainly find the terms: data masking, anonymization, data cloaking, data masquerading.
But why should we anonymize?
Reason for anonymization
At the origin of this anonymization operation, there are various motivations:
- (the most common) Protection of personal data (French “Informatique et Libertés” law of January 6, 1978, at the origin of the creation of the CNIL, the french GDPR control authority),
- Protection of professional and ethical secrecy,
- Protection of financial and economic information,
- Protection of commercial and/or contractual information,
- Protection of strategic information (with high competitive value).
We will not detail any further the applicable texts/laws that deal with the subject of privacy protection, which range from the Charter of Fundamental Rights (of the European Union) to the French law « Informatique et Libertés », while also including the ISO 27001 standard as well as various conventions and directives (worldwide, European and French).
When speaking of personal data, two main families of data must be distinguished:
- Data with high identification value. For example: unique numbers (IBAN, NIR, identity card, driver’s license, license plate, phone), biometric data (DNA, fingerprints), identity photographs, etc., make it possible to identify a person uniquely.
- Data with low identification value. In general, they do not allow a unique identification of a person on their own. However, cross-checking them with other low identifying value information may make it possible. Example: a first name (or a surname) alone only has a low identification value (but it does carry information). However, if associated with a small enough locality, it may be sufficient to uniquely identify a person. The Data Protection Act (article 2) furthermore clearly indicates that any data that can make it possible to directly or indirectly identify a natural person is regarded as personal data.
However, two important questions remain:
- What should be anonymized?
- And how can we do?
Here are some answers.
What should be anonymized?
The answer to this first question is (apparently) simple: all information which makes it possible to directly or indirectly identify a natural person, that is to say including information obtained by cross-checking or cross-referencing with other information of the same source or from other (even external) sources. Classic examples: first name, last name, street, locality, age, date of birth, telephone numbers, email address, etc.
On a specific IT project, it is therefore necessary to carefully analyze and designate in a precise and unambiguous manner all the information that will have to be anonymized. This task, which is the responsibility of the project data manager (usually the project management or the client), is crucial because the slightest oversight can have unfortunate consequences: once the data has been outsourced, it will be too late.
How do we anonymize?
You will probably be disappointed by the answer to this second question, but unfortunately there is no miracle solution (and it is unlikely that there will be one day). This is an IT project like any other, which can of course rely on software solutions. Such solutions must however be chosen with caution, and one must be aware of their limitations. The essential steps in an anonymization project are as follows:
- Start by identifying which data sources must be anonymized (database(s), incoming and/or outgoing flow(s), document model(s), document(s), etc.),
- Choose for each source whether the anonymization should be performed on the fly or in bulk. The first (on the fly) is complex, the second (in bulk) is the most common case at the database level,
- Then identify and precisely locate (for each source) all the data to be anonymized,
- Define, according to each type (business) of data (date, surname/first name, address, telephone number, municipality, etc.), the anonymization technique to be used (see below), and the business and technical constraints (the guarantees to respect). There are many of them and as such we will mention only the most common and relevant in the next chapter,
- Once this is done and if the technical solution has not yet been selected, the previous points’ specifications will make it possible to choose a technical solution judiciously (which therefore makes it possible to implement all the rules defined),
- Implement the anonymization process (development of a specific tool, or configuration and development using a software solution from an editor),
- Carry out an in-depth operational acceptance with data resembling the data found in production. This is not a usual acceptance for a classic IT project: the slightest omission of data (not anonymized), or poor quality anonymization (reversible process for example), will be irreversible once in production. Indeed, in production, the data can be consulted, copied and used for an attempt at de-anonymization, even if the process is corrected afterwards. The acceptance must therefore be carried out in a specific environment and independently of production.
This development process can obviously be done in an iterative and agile way.
Anonymization techniques
The elementary anonymization techniques (which can be combined with each other under certain conditions) and most particularly the guarantees they can (or can not) provide must be well understood:
- Let’s start with the techniques that provide few (and in some cases, no) guarantees regarding anonymized data:
- Mixing of data, also called diffusion, permutation, misalignment, or even shuffling: this involves shuffling the positions of the data (shuffling the first and last names of a table containing information on customers for example). The original data is however still present in the source, only its position is shuffled,
- Dilution, also called drowning, scrambling, or obfuscation: this involves drowning the data within new data with no real meaning. Once again, the original data is still present.
- Aging (date aging) of data: this involves replacing the data at a given time with the same older data. This is only of interest for data with a short lifecycle (and it is still necessary to keep the history of values), which is not the case for personal data anyway.
- Mixing of data, also called diffusion, permutation, misalignment, or even shuffling: this involves shuffling the positions of the data (shuffling the first and last names of a table containing information on customers for example). The original data is however still present in the source, only its position is shuffled,
- Next, the techniques that are generally safe (if they are properly implemented and used, of course):
- Deletion, overwriting or nullification: this involves deleting the information, or replacing it with a fixed value. Example: all first names are replaced by an empty string, and dates of birth by « 01/01/1900 »,
- Random replacement or randomization: this involves replacing each value with a random value (of a compatible type) unrelated to the original data. This point is important: the data should not be used to initialize the random generator because this would risk allowing, under certain conditions, finding the original value (indeed, in computer science, the functions are not random, but pseudo-random),
- Combination, concatenation or composition: this involves combining several values to form others. If these combinations are performed on anonymized values, it is a safe operation (example: combination of anonymized first name and last name to form an anonymized email address); otherwise, it should be avoided,
- Masking, felting, blackening or truncating: this describes the case where certain portions of a document are blackened. Moreover, in the case of a digital document (PDF, Word or others), you have to be extremely careful: the black areas can often be read thanks to layers (PDF, images), revision marks (Office documents), etc,
- Hashing: this involves using a standard cryptographic hash function (SHA-2 or SHA-3), which performs a complex mathematical transformation, irreversible by design,
- Ciphering or encryption : variation of the hashing technique, which uses a standard cryptographic encryption function (AES-256 minimum), and which performs a complex, symmetric and irreversible mathematical transformation without the knowledge of a secret key (which must obviously remain secret),
- Tables of substitutions, translations or correspondences: involves replacing each value by an other, unrelated one but of the same nature (a name by a name, a date by a date, an amount by an amount, etc.). These tables can also establish correspondences between indices (0, 1, 2, …) and business values. Care should be taken to define or to have specialized tables: one for first names, one for surnames, one for dates, one for municipalities, etc.
Of course, several elementary techniques can be combined to anonymize each type of data, and some combinations have no interest. Example: hashing followed by nullification has no interest.
In the second part of our article on anonymization, we will discuss the desired guarantees, the mechanisms of anonymization, the business and technical constraints as well as a concrete example of anonymization and our conclusions on this principle.
Date
11 août 2015
Auteur
Arnaud Witschger
IT Director