A poor person’s guide to Open Sciencing GDPR compliant data management

Open Science
Workflow
Sci-Ops

September 14, 2019

In this brief post I’ll outline a simple data management strategy that is consistent with both GDPR and Open Science principles. For readers unfamiliar with these: the EU’s General Data Protection Regulation (GDPR) somewhat sternly encourages treating personal data decently, and Open Science principles promote sharing data as much as possible.

When are data personal?

We deal with this in more detail in Crutzen, Peters & Mondschein (2019), but basically, personal data are data about a person. So: I am 1 meter and 75 centimeters tall. Because you know that is about me, it is personal data. However, if the average tallness of people in my city is 1.75 meters, that number is not personal data. Instead, it is a fact about the world. Of course, if I would be the only person living in my city, it would be excessively easy to figure out my tallness, which would again render the city average personal data.

The crux, therefore, is identifiability. This means that sampling frame is very important. If a sample is truly aselect, e.g. sampled from millions of people, single variables such as age in years or religion are not personal data. However, if enough such columns are combined, identification becomes possible which then renders all columns personal data - after all, they are then all about persons.

What do you have to do if you handle person data?

Personal data are always owned by the respective persons. Others can only process the data for them. The GDPR holds that this processing by definition always has a temporary nature, and must abide by some rules. These include the responsibility to properly log all processing that is done; make it easy for people to view, change or remove their personal data; maintain a list of everybody who has access to the data; and obviously, prevent data leaks.

Open Sciencing: publishing all data

Open Science, on the other hand, encourages researchers to make all data public as soon as possible. There are very many good reasons for this, but of course it’s hard to combine with the GDPR requirements. Or is it? The solution lies in anonymization. The GDPR only deals with personal data. Once data are no longer identifiable, they can be safely shared.

So, the question that plagues the integreous psychological scientist is: how do I satisfy both the GDPR and Open Science principles simultaneously?

The Answer: encryption

The answer lies is encryption. The point of the GDPR is that you’re not allowed to leak data - you’re not allowed to give access to others. This doesn’t mean you can’t send them the datafile.

Somewhat counterintuitively, it is fully consistent with the GDPR to send datafiles with personal data to anybody you want.

The crux is that as long as they are unable to access the data, it’s as if you didn’t share them.

And as long as a file is encrypted with an algorithm that is virtually uncrackable, they inevitable require the password to access the data. Without the password, the file is useless and can safely be made public - by making the encrypted datafile public, you are not making the data contained therein public.

A workflow

The simplest workflow is, of course, to avoid collecting personal data. In such cases, simply publish your data once you have it using e.g. GitLab, GitHub, an/or OSF. And in most cases, it is possible to redesign a study such that collecting personal data can be avoided, for example using the Research Code (https://researchcode.eu).

However, in some cases, such redesign is impossible. In those cases, I recommend the following workflow. In this workflow, I assume you already have secure protocols for the data collection itself, and the handling of the data up until the point where all data are merged and one big file exists. However, the workflow below is easily extendible to other scenarios.

  1. Make sure only a limited number of people have access to the raw, identifiable dataset. Avoid common cloud services such as Dropbox and Google Drive; instead opt for Zero Knowledge solutions such as Sync.com.
  2. If you use advanced versioning systems such as git, make sure the raw, identifiable file is not synced to the server, for example by including “[PRIVATE]” in the filename, and including a line that contains “\\[PRIVATE]” (both without the double quotes) in your .gitignore file to exclude all files and directories with [PRIVATE] in their name (for more about a git-based research workflow, see this post).
  3. Once the data are complete, run a script to anonymize the data and write the anonymized version to a different filename; this publishable file can then be sycned to GitLab or GitHub (and then synced with OSF).
  4. After you ran the script, archive the raw, identifiable dataset using the Free/Libre and Open Source Software (FLOSS) 7-Zip (see https://www.7-zip.org/ and download the version for your operating system).
  5. When archiving, choose the .7z format (better compression) or the .zip format (native support in many operating systems) and choose AES-256 encryption (see e.g. here, here, and here).
  6. Make sure the password used to encrypt the file is very strong, and store it in a password Manager such as the excellent KeePass2 (see the website and this EU project where money is offered for people finding bugs. Make sure all researchers with access to the raw data (see 1) use such a password manager to store the password.
  7. Send the password to those people using a secure messaging app such as Signal (see https://signal.org/; conveniently, it also has a desktop client). Do not use insecure channels such as email, and avoid potentially unsafe channels such as WhatsApp, for sending passwords.
  8. Delete the raw, identifiable dataset. If you want to be entirely sure it can never be recovered, use a program such as File Shredder to make sure it gets permanently deleted.
  9. You can now make both the encrypted dataset and the anonymized dataset public. Both are GDPR-compliant; the encrypted version is only accessible to people with the password, and the anonymized version does not contain personal data. If anybody else should be granted access to the data, you only need to refer them to the repository where you publish your resources (e.g. OSF, GitLab, GitHub, etc), and then (securely!) send them the password.

Conclusion

I hope this suggested workflow helps people dealing with personal data. If this was useful to you: I outline another workflow that I found useful in “A reproducible research workflow: GitLab, R Markdown, and Open Science Framework”.