Creating a safe AI: specifications, reliability and warranty
  • Transfer
Among the authors of the article are the safety team from the DeepMind company.

Build a rocket hard. Each component requires careful study and testing, with the basis of security and reliability. Rocket scientists and engineers get together to design all systems: from navigation to control, engines and chassis. Once all the parts are assembled, and the systems are checked, only then can we board the astronauts with the confidence that everything will be fine.

If artificial intelligence (AI) is a rocketthen someday we all get tickets on board. And, as in rockets, safety is an important part of creating artificial intelligence systems. Ensuring security requires careful system design from scratch to ensure that the various components work together as intended, while at the same time creating all the tools to monitor the successful operation of the system after it is commissioned.

At a high level, security research at DeepMind focuses on designing reliable systems, while detecting and mitigating possible short-term and long-term risks. AI technical safety- a relatively new, but rapidly developing field, the content of which varies from a high theoretical level to empirical and specific research. The goal of this blog is to contribute to the development of the field and encourage substantive conversation about technical ideas, thereby promoting our collective understanding of the security of AI.

In the first article, we will discuss three areas of AI technical safety: specifications , reliability, and warranties . Future articles will generally conform to the boundaries outlined here. Although our views inevitably change over time, we believe that these three areas cover a wide enough range to provide useful categorization for current and future research.

Three problem areas of AI security. Each block lists some relevant issues and approaches. These three areas are not isolated, but interact with each other. In particular, a specific security issue may include problems from several blocks.

Specifications: system task definition

Specifications ensure that the behavior of the AI ​​system is consistent with the true intentions of the operator.

Perhaps you know the myth of King Midas and the golden touch. In one embodiment, the Greek god Dionis promised Midas any reward he wished, as a sign of gratitude that the king tried his best to show hospitality and mercy to his friend Dionysus. Then Midas asked that everything he touches turns into gold . He was beside himself with the joy of this new power: the oak branch, the stone and the roses in the garden — everything turned into gold at his touch. But he soon discovered the stupidity of his desire: even food and drink turned into gold in his hands. In some versions of the story, even his daughter fell victim to a blessing that turned out to be a curse.

This story illustrates the problem of specifications: how to correctly formulate our desires? Specifications should ensure that the AI ​​system is committed to acting in accordance with the true wishes of the creator, rather than being tuned to a poorly defined or incorrect target. Formally, there are three types of specifications:

  • ideal specification (" wishes "), corresponding to a hypothetical (but difficult to formulated) description of an ideal AI system, fully consistent with the desires of the human operator;
  • the project specification (" blueprint "), which corresponds to the specification we actually use to create an AI system, for example, a specific reward function, to maximize which the reinforced learning system is programmed;
  • the identified specification (" behavior "), which best describes the actual behavior of the system. For example, the reward function revealed as a result of the reverse development after observing the behavior of the system (inverse reinforced learning). This compensation function and specification are usually different from those programmed by the operator, because AI systems are not ideal optimizers or due to other unforeseen consequences of using the design specification.

The specification problem arises when there is a mismatch between the ideal specification and the identified specification , that is, when the AI ​​system does not do what we want from it. Studying the problem from the technical security point of view of AI means: how to design more fundamental and general objective functions and help agents figure out if the goals are not defined? If problems generate a discrepancy between the ideal and project specification, then they fall into the “Design” subcategory, if between the design and the identified, then the “Emergence” subcategory.

For example, in our scientific article AI Safety Gridworlds(where other specification definitions and reliability problems are presented, compared to this article) we give agents a reward function for optimization, but then we evaluate their actual performance by the “safety performance” function, which is hidden from the agents. Such a system models these differences: the security function is an ideal specification that is incorrectly formulated as a remuneration function (design specification), and then implemented by the agents who create the specification, which is implicitly disclosed through their resulting policy.

From OpenAI 's Defective Reward Functions in the Wild : A reinforcement training agent found a random strategy for gaining more points.

As another example, consider the CoastRunners game, analyzed by our colleagues at OpenAI (see the animation above from “Defective reward functions in the wild”). For most of us, the goal of the game is to quickly finish the track and get ahead of other players - this is our ideal specification. However, translating this goal into an exact reward function is difficult, so CoastRunners rewards players (design specification) for hitting targets along the route. Teaching an agent to play through reinforcement training leads to surprising behavior: the agent drives the boat in a circle to capture re-appearing targets, repeatedly breaking and catching fire rather than ending the race. From this behavior we conclude (the identified specification) that in the game the balance between instant reward and full circle reward is broken. there isThere are many more similar examples when AI systems find loopholes in their objective specification.

Reliability: developing systems that resist disruption

Reliability ensures that the AI ​​system continues to operate safely with interference

In real conditions, where AI systems work, a certain level of risk, unpredictability and volatility is necessarily present. Artificial intelligence systems must be resistant to unforeseen events and hostile attacks that may damage or manipulate these systems. Research into the reliability of artificial intelligence systems is aimed at ensuring that our agents remain within safe boundaries, regardless of the conditions that arise. This can be achieved by avoiding risks ( prevention ) or by self-stabilization and smooth degradation ( restoration ). Security issues arising from distribution shear , hostile inputs (adversarial inputs) andunsafe exploration (unsafe exploration), can be classified as a problem of reliability.

To illustrate the solution to the problem of distribution shear , consider a home cleaning robot that usually cleans rooms without pets. Then the robot was launched into the house with a pet - and artificial intelligence collided with it during cleaning. A robot that has never seen cats and dogs before will wash them with soap, which will lead to undesirable results ( Amodei and Olah et al., 2016 ). This is an example of a reliability problem that can arise when the distribution of data during testing is different from the distribution during training.

From the work of AI Safety Gridworlds. The agent learns to avoid lava, but when tested in a new situation, when the location of the lava has changed, he is not able to generalize knowledge - and runs straight into the lava.

Hostile entrance is a specific case of distributional shift, where the input data are specifically designed to trick the AI ​​system.

A hostile entry superimposed on ordinary images may cause the classifier to recognize the sloth as a racing car. Two images differ by a maximum of 0.0078 in each pixel. The first is classified as a three-toed sloth with a probability of more than 99%. The second is like a race car with a probability of more than 99%.

Unsafe researchcan demonstrate a system that seeks to maximize its performance and goals, with no guarantee that safety will not be compromised during the study, as it learns and explores in its environment. An example is a cleaning robot that pokes a wet mop into an electrical outlet, learning the best cleaning strategies ( García and Fernández, 2015 ; Amodei and Olah et al., 2016 ).

Warranties: monitoring and control of system activity

Assurance guarantees that we are able to understand and control AI systems during operation.

Although an elaborate safety precaution can eliminate many risks, it is difficult to do everything right from the start. After the commissioning of AI systems, we need tools for their continuous monitoring and configuration. Our last category, a guarantee (assurance), addresses these problems in two ways: monitoring and submission (enforcing).

Monitoring includes all methods of checking systems for analyzing and predicting their behavior, both with the help of human inspections (summary statistics) and with the help of automated inspections (to analyze a huge number of logs). On the other hand, submissioninvolves the development of mechanisms to control and limit the behavior of systems. Issues such as interpretability and interruptibility belong to the subcategories of control and subordination, respectively.

Artificial intelligence systems are not like us either in appearance or in the way of data processing. This creates interpretative problems . Well-designed measurement tools and protocols allow you to evaluate the quality of decisions made by an artificial intelligence system ( Doshi-Velez and Kim, 2017). For example, a medical artificial intelligence system would ideally make a diagnosis along with an explanation of how it arrived at such a conclusion — so that doctors could test the reasoning process from beginning to end ( De Fauw et al., 2018 ). In addition, to understand more complex systems of artificial intelligence, we could even use automated methods for constructing models of behavior using machine theory of mind ( Rabinowitz et al., 2018 ).

ToMNet detects two subspecies of agents and predicts their behavior (from the “Machine Theory of Mind” ).

Finally, we want to be able to turn off the AI ​​system if necessary. This is an interruptibility issue. Designing a robust switch is very difficult: for example, because an AI system with maximizing rewards usually has strong incentives to prevent this ( Hadfield-Menell et al., 2017 ); and because such interruptions, especially frequent ones, ultimately change the original task, forcing the AI ​​system to draw wrong conclusions from experience ( Orseau and Armstrong, 2016 ).

The problem with interruptions: human intervention (i.e. pressing the stop button) can change the task. In the figure, the interrupt adds a transition (in red) to the Markov decision process that changes the original problem (in black). See Orseau and Armstrong, 2016

Looking to the future

We are building a foundation of technology that will be used for many important applications in the future. It should be borne in mind that some solutions that are not critical to security at system startup may become such when the technology becomes widespread. Although at one time these modules were integrated into the system for convenience, it would be difficult to fix the problems without complete reconstruction.

Two examples from the history of computer science can be cited: this is a null pointer that Tony Hoar called his “billion-dollar error” , and the gets () procedure in C. If early programming languages ​​were designed with security in mind, progress would be slowed, but it is likely that This would have a very positive impact on modern information security.

Now, having carefully thought out and planned everything, we are able to avoid similar problems and vulnerabilities. We hope that the categorization of problems from this article will serve as a useful basis for such methodical planning. We strive to ensure that in the future, AI systems will not only work according to the principle “I hope, safely”, but really reliably and verifiably safely, because we built them this way!

We look forward to continuing exciting progress in these areas, in close collaboration with the broader AI research community, and encourage people from different disciplines to consider contributing to AI security research.


For reading on this topic, below is a selection of other articles, programs, and taxonomies that helped us in compiling our categorization or contain a useful alternative view on the technical security issues of AI:

Also popular now: