Data Privacy and Distributed Compute

Hugo Bowne-Anderson
July 30, 2020

On our #ScienceThursday live stream, we recently caught up with Katharine Jarmul, Head of Product at Cape Privacy, about

  • data privacy-enhancing techniques and when to use them;
  • how to write policy for privacy-enhancing techniques and apply them to a pandas DataFrame;
  • when transformations might be important during distributed data processing and how distributed computing in machine learning could be a harbinger for advanced privacy techniques, such as federated learning.

You can check out the YouTube video below and, in this post, we’ll summarize the key takeaways from the bullets above.

After brief introductions to Katharine and Cape Privacy and motivation into why privacy-preserving data transformations are important, we jumped into coding and using the OSS Cape Python package on pandas DataFrames.

Privacy-Preserving transformations for pandas DataFrames

You can use Cape Python to add some straightforward privacy techniques to any data science workflows you’re already doing. They’re currently focused on the exploratory data analysis (EDA) & pre-processing steps of the data science workflow and there’ll be more coming on for machine learning and inference soon.

Katharine took us through some mock health and biometric data (IoT, app, or fitbit) and reasoned with us about privacy challenges inherent in the data:

  • There are timestamps in the dataset and this is an area for privacy concern. Take the Netflix challenge example: each review had a time stamp of reviews, which was used to link to reviews in IMDB and hence de-anonymized. De-anonymization and re-identification approaches are sophisticated and we need to be aware of them when reasoning through privacy implications of datasets (check out Narayanan & Shmatikov’s paper How To Break Anonymity of the Netflix Prize Dataset).
  • We also initially saw that most names are not represented many times yet some names are represented many times in the tail of the name count histogram. This means that some users are very active while many are not. What are the privacy implications of this dataset? What this means is that privacy is not evenly distributed already!
Hugo Bowne-Anderson commenting on Katharine Jarmul's Cape Python with PySpark demo on Coiled's Science Thursday live stream.

With respect to quantifiable privacy, the tail has a lot more potential personal privacy loss when we analyze their data. Perhaps this is okay but these are questions we need to be asking. Particularly knowing that, in many places, marginalized and structurally oppressed communities are over-represented in many datasets, such as people receiving government aid. We encourage you to check out Automating Inequality by Virginia Eubanks for more on this.

Datetimes and privacy methods

Let’s say we’re interested in temperature and heart rate over time but we don’t need the exact time and are happy with truncation, which allows us to preserve some privacy. Katharine showed us how we can use pandas_transformations.DateTruncation to do so. This is your first Cape Python transformation! You can see the input and output in the screenshot below:

Katharine Jarmul coding in a Jupyter Notebook for a Cape Python with PySpark demo on Coiled's Science Thursday.

Names and privacy methods

The next task was to reason about preserving privacy in names and the approach Katharine took in this demo was tokenizing names, which is a hashing mechanism.

Word of warning: the relevant function pandas_transformations.DateTruncation has the kwarg key which, if applied to the same input, will yield the same output. This implies a deterministic solution and these are prone to linkage attacks: given enough data and maybe even data from different sources, if you can identify an individual from other data, this could result in a privacy violation.

The table that is generated from these transformations is now the one that we now use for regular data science and that we would give more people in an organization access to. Note that it is not totally anonymized but some privacy is obfuscated.

Distributed Privacy Preservation

We then moved to a distributed setting: Katharine took us through the NYC taxi dataset, which is now somewhat infamous, as it was hashed in a way that was easily reversible. This meant that it was straightforward to figure out the taxi medallion for each row, which had all types of privacy concerns.

We saw that, instead of writing your data transformations in an ad hoc, EDA style in notebooks, you can actually write your policies as .yaml files and apply them (we also saw the limitations of Spark in notebooks). Check out the screenshot below and notice that Katharine is applying transformations to passenger count, along with several latitudes and longitudes.

Katharine Jarmul coding in a Jupyter Notebook for a Cape Python with PySpark demo on Coiled's Science Thursday.

Another highlight of this section was Matt jumping into the Cape Python source code and seeing how it could be made Dask-dataframe compatible.

Privacy-preserving machine learning and distributed compute

We wrapped up with a general conversation about preserving privacy in machine learning applications and how distributed computing and privacy are not only compatible but also support one another and can be mutually beneficial, especially when considering today’s data science and machine learning problems.

Katharine posed the question “can we move to a place where we can actually reason about data without always taking it from someone and storing it somewhere else?” This dovetailed into a conversation about federated learning and federated analytics, which is essentially the ability to compute a machine learning model or train a machine learning model with data across a distributed series of devices or compute clusters or wherever the data is.

Hugo gave the federated learning example that Google has worked on with respect to cell phone data (note: google was thinking about this for reasons other than privacy, such as lower latency and power consumption):

“Your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.”

If you enjoyed this recap, check out the video and sign up to our newsletter below for updates on future #ScienceThursday live streams!

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud