On our #ScienceThursday live stream, we recently caught up with Katharine Jarmul, Head of Product at Cape Privacy, about
You can check out the YouTube video below and, in this post, we’ll summarize the key takeaways from the bullets above.
After brief introductions to Katharine and Cape Privacy and motivation into why privacy-preserving data transformations are important, we jumped into coding and using the OSS Cape Python package on pandas DataFrames.
You can use Cape Python to add some straightforward privacy techniques to any data science workflows you’re already doing. They’re currently focused on the exploratory data analysis (EDA) & pre-processing steps of the data science workflow and there’ll be more coming on for machine learning and inference soon.
Katharine took us through some mock health and biometric data (IoT, app, or fitbit) and reasoned with us about privacy challenges inherent in the data:
With respect to quantifiable privacy, the tail has a lot more potential personal privacy loss when we analyze their data. Perhaps this is okay but these are questions we need to be asking. Particularly knowing that, in many places, marginalized and structurally oppressed communities are over-represented in many datasets, such as people receiving government aid. We encourage you to check out Automating Inequality by Virginia Eubanks for more on this.
Let’s say we’re interested in temperature and heart rate over time but we don’t need the exact time and are happy with truncation, which allows us to preserve some privacy. Katharine showed us how we can use pandas_transformations.DateTruncation to do so. This is your first Cape Python transformation! You can see the input and output in the screenshot below:
The next task was to reason about preserving privacy in names and the approach Katharine took in this demo was tokenizing names, which is a hashing mechanism.
Word of warning: the relevant function pandas_transformations.DateTruncation has the kwarg key which, if applied to the same input, will yield the same output. This implies a deterministic solution and these are prone to linkage attacks: given enough data and maybe even data from different sources, if you can identify an individual from other data, this could result in a privacy violation.
The table that is generated from these transformations is now the one that we now use for regular data science and that we would give more people in an organization access to. Note that it is not totally anonymized but some privacy is obfuscated.
We then moved to a distributed setting: Katharine took us through the NYC taxi dataset, which is now somewhat infamous, as it was hashed in a way that was easily reversible. This meant that it was straightforward to figure out the taxi medallion for each row, which had all types of privacy concerns.
We saw that, instead of writing your data transformations in an ad hoc, EDA style in notebooks, you can actually write your policies as .yaml files and apply them (we also saw the limitations of Spark in notebooks). Check out the screenshot below and notice that Katharine is applying transformations to passenger count, along with several latitudes and longitudes.
Another highlight of this section was Matt jumping into the Cape Python source code and seeing how it could be made Dask-dataframe compatible.
We wrapped up with a general conversation about preserving privacy in machine learning applications and how distributed computing and privacy are not only compatible but also support one another and can be mutually beneficial, especially when considering today’s data science and machine learning problems.
Katharine posed the question “can we move to a place where we can actually reason about data without always taking it from someone and storing it somewhere else?” This dovetailed into a conversation about federated learning and federated analytics, which is essentially the ability to compute a machine learning model or train a machine learning model with data across a distributed series of devices or compute clusters or wherever the data is.
Hugo gave the federated learning example that Google has worked on with respect to cell phone data (note: google was thinking about this for reasons other than privacy, such as lower latency and power consumption):
“Your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.”
If you enjoyed this recap, check out the video and sign up to our newsletter below for updates on future #ScienceThursday live streams!