Membership Inference Attacks Against Machine Learning Models

From a privacy point-of-view, whether a data sample (image, text, audio etc) has been used in a model’s training or not (supervised learning), is a sensitive information in many cases. Infering/extracting this specific information, is what we are referring to as Membership Inference.

Membership inference can be used both to support or to undermine privacy.

On one hand, it can violate privacy by revealing private information. For example, infering membership of a patient’s medical data on the training set of a disease-prediction model, may disclose the patient’s medical condition.
On the other hand, people can utilise membership inference to check if their private information was used for training without their permission. Thus, improving privacy by enabling audits and ensuring accountability in data usage.

Published in 2017, "Membership Inference Attacks Against Machine Learning Models" was the first ever work of its kind. So, many of the aspects of this work, are subject to get updated by now (1 January, 2025) due to the rapidly evolving AI landscape.

But this paper can still offer us some foundational knowledge that could help us better navigate through the field.

Reference Paper

What is Membership Inference Attack?

Membership Inference Attack is one of the most popular privacy attacks on AI, that tries to determine whether an input sample was part of the model’s training dataset or not.

The authors have tried to consider the worst case scenario from an attacker’s perspective to prove the legitimacy of the attack.

Assumptions

A data sample of choice

(Image, text, audio etc)
Black-box access to a target model

(Only the model’s prediction/inference is accessible. Not the internal parameters of the model)

Objective

Determine whether the data sample was in the target model’s training dataset or not

Attacking black-box models is more complex than attacking white-box models whose structure and parameters are known to the adversary. The authors evaluated their techniques against the popular MLaaS (Machine Learning as a Service) platforms which provide black box access to machine learning services via API calls.

Approach

The key observation that guided the authors was that machine learning models often behave differently on the data that they were trained on versus the data that they “see” for the first time.

So, they tried to use machine learning to detect this difference in behavior of the target model. Thus, converting the Membership Inference Problem into a Binary Classification Problem by using Adversarial Machine Learning.

Goal

Construct an Attack Model that can recognize differences in the target model’s predictions on the inputs that it trained on versus the inputs that it did not train on.
Challenge

The adversary has no information about the internal parameters of the target model and only limited query access (black-box access) to it through the public API.
Observation (Experimental)

Similar models trained on relatively similar data records using the same service behave in a similar way.

So, the authors introduced a novel Shadow Training technique that enables training the attack model on proxy targets for which the training dataset is known.

Shadow Training (Key Innovation)

Shadow Training

Create multiple Shadow Models to imitate the behavior of the Target Model
Generate training data for the Shadow Models

An adversary can generate synthetic training data using the Target Model (Model-based Synthesis) or statistics about the underlying population (Statistic-based Synthesis). Or he/she might have access to a potentially noisy version of the Target Model’s training dataset (Real-world Data).

In this work, the authors have evaluated on the worst case scenario where the shadow training dataset is disjoint from the target model’s training dataset. But in real world, they will have some overlap between them as the adversary will have some idea about the population of the data. Thus, may result in a better attack performance.
Train (Supervised) the Shadow Models using the generated data

Must be similar to the target model’s training. It’s feasible when the target model’s training algorithm and model structure is known. Or in case of MLaaS (Mahine Learning as a Service), the attacker can use exactly the same service (e.g., Google Prediction API) to train the shadow model as was used to train the target model.
Train (Supervised) the Attack Model on the labeled inputs and outputs of the shadow models

To recognize differences in shadow models’ behavior when these models operate on inputs from their own training datasets versus inputs they did not encounter during training.

Attack Model Training

Query each shadow model with its own training dataset and with a disjoint test set of the same size. The outputs of the training dataset are labeled “in” (Member), the outputs of the test set are labeled “out” (Not Member).
Now, we have a dataset of records, the corresponding outputs of the shadow models, and the in/out labels.
The objective of the attack model is to infer the labels (“in” or “out”) from the records and corresponding outputs.

This binary classification task can be performed using any state-of-the-art machine learning framework or service.

The proposed techniques are generic and not based on any particular dataset or model type.

Evaluation

Datasets

CIFAR-10, CIFAR-100, Purchase Dataset (based on Kaggle’s acquire valued shoppers), Location Dataset (from the publicly available set of mobile users’ location “check-ins”), Texas Hospital Stays Dataset, MNIST, UCI Adult (Census Income)

Target Models

Google Prediction API, Amazon ML, Locally Implemented Neural Network

Conditions

No overlap between the datasets of the target model and those of the shadow models.
But the datasets used for different shadow models can overlap with each other.
Evaluate on equal number of members and non-members of the target’s dataset in order to maximize the uncertainty of the attack model’s inference. Thus, the baseline accuracy is 0.5.

Key Findings

The accuracy of the attack varied considerably for different classes due to the difference in size and composition of the training data belonging to each class and highly depends on the dataset.
Attack Accuracy on Purchase Dataset
Effect of Shadow Training Data
- Even with noisy version of the real data (disjoint from the target model’s training dataset but sampled from the same population), the attack still outperforms the baseline.
  Thus, providing robustness against the attacker's inaccurate assumption about the distribution of the target model's training data.
- The accuracy of the attack using marginal-based (statistic-based) synthetic data is noticeably reduced versus real data, but is nevertheless very high for most classes.
- The attack using model-based synthetic data exhibits dual behavior. For most classes its precision is high and close to the attacks that use real data for shadow training, but for a few classes precision is very low (less than 0.1). The reason behind that is, those classes are under-represented in the target model's training dataset.
  Thus, the attacker need to generate inputs that are classified by the target model with high confidence.
Effect of the number of classes and training data per class
- More classes means more signals about the internal state of the model that are available to the attacker.
  In other words, models with more output classes need to remember more about their training data. Thus, they leak more information resulting in higher attack accuracy.
- The more data that is associated with a given class, the lower the attack precision for that class.
Effect of Overfitting
- Overfitting is not the only factor that causes a model to be vulnerable to membership inference. The structure and type of the model also contribute to the problem.
- For a specific type of model, the more overfitted a model, the more it leaks.

Mitigation Strategies

The authors have discussed about the following mitigation strategies:

Restrict the prediction vector to top k classes

Still useful for the users, but less exposing to the attackers.
Round the classification probabilities in the prediction vector
Increase entropy of the prediction vector

Modify (or add) the softmax layer and increase its normalizing temperature For a very large temperature, the output becomes almost uniform and independent of its input, thus leaking no information. So, there will be a tradeoff between utility and information leakage.
Use regularization to overcome overfitting

From the evaluation stage, the authors have found that overfitting is an important factor in information leakage of machine learning models, although it’s not the only one.

In the realm of Machine Learning, overfitting in itself is a canonical problem that limits the predictive power of the model.

So, if we can mitigate overfitting we are not only inreasing the accuracy of the model, but also tackling the information leakage problem. Thus, instead of considering tradeoffs between privacy and utility, we can now improve both.
Differential Privacy

By definition, differential privacy based training is immune to membership inference attacks. But that may significantly reduce the model’s accuracy.

Conclusion

This work proposed a novel technique called Shadow Training to attack machine learning models to infer memberships.
Through the extensive experiments and evaluations, it proivided us with a good understanding on how machine learning models leak information about their training datasets and how different factors contribute to that.
Moreover, the authors suggested some mitigation strategies, have implemented and evaluated those strategies. Although they found the validity of the mitigation strategies, it was not enough to fully prevent membership inference attacks.

Which in turn, solidified the privacy implications that they addressed on their work.