Suggestions for Vulnerabilities and Protection of Machine Learning Models

Original author: Patrick Hall
  • Transfer

Recently, experts are increasingly addressing the issue of security of machine learning models and offer various protection methods. It's time to study in detail potential vulnerabilities and defenses in the context of popular traditional modeling systems, such as linear and tree models, trained on static datasets. Although the author of this article is not a security expert, he carefully follows topics such as debugging, explanations, fairness, interpretability, and privacy in machine learning.

In this article, we present several likely vectors of attacks on a typical machine learning system in a typical organization, offer tentative solutions for protection, and consider some common problems and the most promising practices.

1. Data corruption attacks

Data distortion means that someone systematically changes training data to manipulate the predictions of your model (such attacks are also called “causal” attacks). To distort data, an attacker must have access to some or all of your training data. And in the absence of proper control in many companies, different employees, consultants and contractors can have such access. An unauthorized access to some or all of the training data can also be obtained by an attacker outside the security perimeter.

A direct attack to corrupt data may include changing dataset labels. Thus, whatever the commercial use of your model, an attacker can manage its forecasts, for example, by changing the labels so that your model can learn how to give large loans, big discounts or establish small insurance premiums for attackers. Forcing a model to make false predictions in the interests of an attacker is sometimes called a violation of the "integrity" of the model.

An attacker can also use data corruption to train your model for the purpose of deliberately discriminating against a group of people, depriving them of a large loan, large discounts or low insurance premiums that they are entitled to. At its core, this attack is similar to DDoS. Forcing a model to make false predictions in order to harm others is sometimes called a violation of the model’s “accessibility”.

Although it may seem that it is easier to distort data than changing values ​​in existing rows of a dataset, you can also introduce distortions by adding seemingly harmless or extra columns to the dataset. Changed values ​​in these columns can then cause model predictions to change.

Now let's look at some possible protective and expert (forensic) solutions in case of data corruption:

  • Differentiated impact analysis . Many banks are already conducting differential impact analysis for fair lending to determine if their model is discriminated against by different categories of people. However, many other organizations have not yet come so far. There are several excellent open source tools for detecting discrimination and conducting differential impact analysis. For example, Aequitas, Themis, and AIF360 .
  • Fair or private models . Models such as training in good performances (learning fair representations - LFR) and private training ensembles aggregation (private aggregation of teacher ensembles - PATE ), are trying to focus less on individual demographic characteristics in generating forecasts. Also, these models may be less susceptible to discriminatory attacks in order to distort data.
  • Rejection on Negative Impact (RONI) . RONI is a method of removing data rows from a dataset that reduce prediction accuracy. For more information about RONI, see Section 8, Machine Learning Security .
  • Residual analysis . Search for strange, noticeable patterns in the residuals of your model forecasts, especially those related to employees, consultants or contractors.
  • Self-reflection . Evaluate models on your employees, consultants and contractors to identify abnormally favorable forecasts.

Differentiated impact analysis, residual analysis and self-reflection can be carried out during training and in the framework of real-time monitoring of models.

2. Watermark Attacks

A watermark is a term borrowed from the literature on the safety of deep learning, which often refers to the addition of special pixels to the image to obtain the desired result from your model. It is entirely possible to do the same with customer or transaction data.

Consider a scenario in which an employee, consultant, contractor or attacker from outside has access to the code for the production-use of your model that makes real-time forecasts. Such a person can change the code to recognize a strange or unlikely combination of input variable values ​​to obtain the desired prediction result. Like data corruption, watermark attacks can be used to violate the integrity or accessibility of your model. For example, in order to violate integrity, an attacker can insert a “payload” into the evaluation code for the production use of the model, as a result of which it recognizes a combination of 0 years old at address 99, which will lead to some positive forecast for the attacker.

Protective and expert approaches to attacks using watermarks can include:

  • Anomaly detection . Autocoders is a fraud detection model that can identify inputs that are complex and strange, or not like other data. Potentially, auto-encoders can detect any watermarks used to trigger malicious mechanisms.
  • Data Integrity Limitations . Many databases do not allow strange or unrealistic combinations of input variables, which could potentially prevent watermark attacks. The same effect may work for integrity constraints to data streams that are received in real time.
  • Differentiated exposure analysis : see section 1 .
  • Version control . The evaluation code for the production application of the model must be versioned and controlled, like any other critical software product.

Anomaly detection, data integrity limitations, and differential impact analysis can be used during training and as part of real-time model monitoring.

3. Inversion of surrogate models

Usually, “inversion” is called obtaining unauthorized information from a model, rather than placing information in it. Also, inversion can be an example of a “reconnaissance reverse engineering attack”. If an attacker is able to get a lot of predictions from the API of your model or other endpoint (website, application, etc.), he can train his own surrogate model. Simply put, this is a simulation of your predictive model! Theoretically, an attacker can train a surrogate model between the input data used to generate the received forecasts and the forecasts themselves. Depending on the number of predictions that can be received, the surrogate model can become a fairly accurate simulation of your model. After training the surrogate model, the attacker will have a “sandbox” from which he can plan impersonalization (ie, “imitation”) or an attack with a competitive example on the integrity of your model, or gain the potential to start recovering some aspects of your confidential training data. Surrogate models can also be trained using external data sources that are somehow consistent with your predictions, such asProPublica with the author's COMPAS recidivism model.

To protect your model from inversion using a surrogate model, you can rely on such approaches:

  • Authorized Access . Request additional authentication (for example, two-factor) to get a forecast.
  • Regulating projections frequency (Throttle predictions) . Limit a large number of quick forecasts from individual users; consider the possibility of artificially increasing prediction delays.
  • "White" (white-hat) surrogate models . As a white hacker exercise, try the following: train your own surrogate models between your input and model forecasts for a production application, and carefully observe the following aspects:
    • boundaries of accuracy of various types of “white” surrogate models; try to understand the extent to which the surrogate model can actually be used to obtain unwanted data about your model.
    • types of data trends that can be learned from your “white” surrogate model, for example, linear trends represented by linear model coefficients.
    • types of segments or demographic distributions that can be studied by analyzing the number of persons assigned to certain nodes of the “white” surrogate decision tree.
    • the rules that can be learned from the “white” surrogate decision tree, for example, how to accurately depict a person who will receive a positive forecast.

4. Rivalry attacks

In theory, a dedicated hacker can learn — say, trial and error (ie, “intelligence” or “sensitivity analysis”) - invert a surrogate model or social engineering, how to play with your model to get the desired prediction result or avoid the undesirable forecast. Attempting to achieve such goals using a specially designed data string is called an adversarial attack. (sometimes an attack to investigate integrity). An attacker can use an adversarial attack to get a large loan or low insurance premium, or to avoid denial of parole with a high assessment of criminal risk. Some people call the use of competitive examples to exclude an undesirable result from a forecast as “evasion”.

Try the methods described below to defend or detect an attack with a competitive example:

  • Activation analysis . Activation analysis requires that your predictive models have comparative internal mechanisms, for example, the average activation of neurons in your neural network or the proportion of observations related to each end node in your random forest. Then you compare this information with the behavior of the model with real incoming data streams. As one of my colleagues said: “ It’s the same as seeing one end node in a random forest, which corresponds to 0.1% of training data, but is suitable for 75% of scoring lines per hour .”
  • Anomaly detection . see section 2 .
  • Authorized Access . see section 3 .
  • Comparative models . When evaluating new data, in addition to a more complex model, use a high transparency comparative model. Interpreted models are harder to crack because their mechanisms are transparent. When evaluating new data, compare the new model with a reliable transparent model, or a model trained on verified data and on a trusted process. If the difference between the more complex and opaque model and the interpreted (or verified) one is too large, return to the conservative model forecasts or process the data line manually. Record this incident, it could be an attack with a competitive example.
  • Throttle forecasts : see section 3 .
  • "White" sensitivity analysis . Use sensitivity analysis to conduct your own research attacks to understand which variable values ​​(or combinations of them) can cause large fluctuations in forecasts. Look for these values ​​or combinations of values ​​when evaluating new data. To conduct a “white” research analysis, you can use the open source package cleverhans .
  • White surrogate models: see section 3 .

Activation analysis or comparative models can be used during training and as part of real-time monitoring of models.

5. Impersonalization

A dedicated hacker can find out - again, through trial and error, through inversion with a surrogate model or social engineering - which input data or specific people get the desired prediction result. An attacker can then impersonate this person to benefit from forecasting. Impersonalization attacks are sometimes called “simulated” attacks, and from the point of view of the model, this is reminiscent of identity theft. As in the case of a competitive example attack, with impersonalization the input data is artificially changed according to your model. But, in contrast to the same attack with a competitive example, in which a potentially random combination of values ​​can be used for fraud, in impersonalization, information is used to obtain the forecast associated with this type of object, associated with another simulated object (e.g., convict, client, employee, financial transaction, patient, product, etc.). Suppose an attacker can find out on what characteristics of your model the provision of large discounts or benefits depends. Then he can falsify the information you use to get such a discount. An attacker can share his strategy with others, which can lead to big losses for your company.

If you are using a two-stage model, beware of an “allergic” attack: an attacker can simulate a string of normal input data for the first stage of your model in order to attack its second stage.

Protective and expert approaches for attacks with impersonalization may include:

  • Activation analysis. see section 4 .
  • Authorized Access. see section 3 .
  • Check for duplicates. At the scoring stage, track the number of similar records for which your model is available. This can be done in a reduced dimensional space using autocoders, multi-dimensional scaling (MDS), or similar dimensional reduction methods. If there are too many similar lines in a given period of time, take corrective measures.
  • Threat notification features. Save a function in your pipeline num_similar_queriesthat may be useless immediately after training or implementing your model, but can be used during the evaluation (or during future retraining) to notify the model or pipeline of threats. For example, if at the time of rating the value is num_similar_queriesgreater than zero, then the request for evaluation can be sent for manual analysis. In the future, when you re-train the model, you can teach it to give num_similar_queriesnegative forecasting results to rows of input data with high values .

Activation analysis, duplicate checking and notification of potential threats can be used during training and in the monitoring of models in real time.

6. Common problems

Some common machine learning uses also pose more general security issues.

Black boxes and unnecessary complexity . Although recent advances in interpreted models and model explanations make it possible to use accurate and transparent non-linear classifiers and regressors, many machine learning processes continue to focus on black box models. They are just one type of often unnecessary complexity in the standard workflow of commercial machine learning. Other examples of potentially harmful complexity can be overly exotic specifications or a large number of package dependencies. This can be a problem for at least two reasons:

  1. A persistent and motivated hacker can learn more about your overly complex black box simulation system than you or your team (especially in today's overheated and rapidly changing market for “analyzing” data). For this, an attacker can use many new model- independent explanation methods and a classic sensitivity analysis, apart from many other more common hacking tools. This imbalance of knowledge can potentially be used to carry out the attacks described in sections 1-5, or for other types of attacks that are still unknown.
  2. Machine learning in research and development environments is heavily dependent on a diverse ecosystem of open source software packages. Some of these packages have many participants and users, others are highly specialized and are needed by a small circle of researchers and practitioners. It is known that many packages are supported by brilliant statisticians and machine learning researchers who focus on mathematics or algorithms, rather than software engineering and certainly not security. There are frequent cases where the machine learning pipeline depends on dozens or even hundreds of external packages, each of which can be hacked to conceal a malicious “payload”.

Distributed systems and models . Fortunately or unfortunately, we live in an age of big data. Many organizations today use distributed data processing and machine learning systems. Distributed computing can be a big target for attacks from within or from outside. Data can be distorted only on one or several working nodes of a large distributed data storage or processing system. The back door for watermarks can be encoded into one model of a large ensemble. Instead of debugging one simple dataset or model, practitioners should now study data or models scattered across large computing clusters.

Distributed Denial of Service (DDoS) attacks. If a predictive modeling service plays a key role in your organization’s activities, make sure that you take into account at least the most popular distributed DDoS attacks when attackers attack an predictive service with an incredibly large number of requests in order to delay or stop making forecasts for legitimate users.

7. General decisions

You can use several common, old and new, most effective methods to reduce security system vulnerabilities and increase fairness, controllability, transparency and trust in machine learning systems.

Authorized access and frequency regulation (throttling) forecasting . Standard security features, such as additional authentication and prediction frequency adjustment, can be very effective in blocking a number of attack vectors described in sections 1-5.

Comparative models. As a comparative model for determining whether any manipulations were made with the forecast, you can use the old and proven modeling pipeline or other interpreted forecasting tool with high transparency. Manipulation includes data corruption, watermark attacks, or competitive examples. If the difference between the forecast of your tested model and the forecast of a more complex and opaque model is too large, write down such cases. Send them to analysts or take other measures to analyze or correct the situation. Serious precautions must be taken to ensure that your benchmark and conveyor remain safe and unchanged from their original, reliable condition.

Interpreted, fair or private models . Currently, there are methods (e.g., monotone GBM (M-GBM), scalable Bayesian rule lists (SBRL) , explainable neural networks (XNN)), which provide both accuracy and interpretability. These accurate and interpretable models are easier to document and debug than classic black boxes of machine learning. Newer types of fair and private models (for example, LFR, PATE) can also be trained in how to pay less attention to externally visible demographic characteristics that are available for observation, using social engineering during an attack with a competitive example, or impersonalization. Are you considering creating a new machine learning process in the future? Consider building it on the basis of less risky interpreted private or fair models. They are easier to debug and potentially resistant to changes in the characteristics of individual objects.

Debugging a model for security . A new area for debugging models is devoted to detecting errors in mechanisms and forecasts of machine learning models and correcting them. Debugging tools, such as surrogate models, residual analysis, and sensitivity analysis, can be used in white trials to identify your vulnerabilities, or in analytical exercises to identify any potential attacks that may or may occur.

Model documentation and explanation methods. Model documentation is a risk reduction strategy that has been used in banking for decades. It allows you to save and transfer knowledge about complex modeling systems as the composition of model owners changes. Documentation has traditionally been used for linear models of high transparency. But with the advent of powerful, accurate explanation tools (such as the SHAP tree and derived-based attributes of local functions for neural networks), pre-existing black box model workflows can be at least a little explained, debugged and documented. Obviously, the documentation should now include all security objectives, including known, fixed, or expected vulnerabilities.

Monitor and manage models directly for security. Serious practitioners understand that most models are trained on static "snapshots" of reality in the form of datasets, and that in real time the accuracy of forecasts decreases, since the current state of things is moving away from the information collected earlier. Today, monitoring of most models is aimed at identifying such a bias in the distribution of input variables, which, ultimately, will lead to a decrease in accuracy. Model monitoring should be designed to track the attacks described in sections 1-5 and any other potential threats that come to light when debugging your model. Although this is not always directly related to safety, models should also be evaluated in real time for differentiated effects. Along with model documentation, all modeling artifacts,

Threat notification features . Functions, rules and stages of preliminary or subsequent processing may be included in your models or processes equipped with means of notification of possible threats: for example, the number of similar lines in the model; whether the current line represents an employee, contractor, or consultant; Are the values ​​in the current line similar to those obtained with white attacks with a competitive example? These functions may or may not be needed during the first training of the model. But saving space for them can one day be very useful in evaluating new data or with subsequent retraining of the model.

System abnormality detection. Train the metamode for detecting anomalies based on an autocoder on the operational statistics of your entire predictive modeling system (the number of forecasts for a certain period of time, delays, CPU, memory and disk loading, the number of simultaneous users, etc.), and then carefully monitor this metamodel for anomalies. An anomaly can tell if something goes wrong. Follow-up investigations or special mechanisms will be required to accurately track the cause of the problem.

8. References and information for further reading

A large amount of modern academic literature on machine learning security focuses on adaptive learning, deep learning, and encryption. However, so far the author does not know the practitioners who would actually do all this. Therefore, in addition to recently published articles and posts, we present articles of the 1990s and early 2000s on network violations, virus detection, spam filtering, and related topics, which were also useful sources. If you want to learn more about the fascinating topic of protecting machine learning models, here are the main links - from the past and present - that were used to write the post.


Those who care about the science and practice of machine learning are worried about the fact that the threat of hacking with machine learning, coupled with the growing threats of breach of confidentiality and algorithmic discrimination, can increase the growing public and political skepticism about machine learning and artificial intelligence. We all need to remember the difficult times for AI in the recent past. Security vulnerabilities, privacy breaches, and algorithmic discrimination could potentially be combined, leading to reduced funding for computer training research, or to draconian measures to regulate this area. Let us continue the discussion and resolution of these important issues in order to prevent a crisis, and not to disrupt its consequences.

Also popular now: