Work on ensuring the security of artificial intelligence must begin today

Original author: Scott Alexander
  • Transfer

I


In a recent article about the risks associated with AI, one of the commentators asked me to briefly formulate arguments proving the seriousness of the situation. I wrote something like this:

1. If humanity does not self-destruct, then in the end we will create an AI of the human level.
2. If humanity can create an AI of the human level, then progress will continue, and in the end we will come to an AI of a level much higher than human.
3. If such an AI appears, it will eventually become so much stronger than humanity that our existence will depend on whether its goals coincide with ours.
4. Already now it is possible to conduct useful research that will increase our chances of successfully solving the problem of matching our goals.
5. Since we can already begin these studies, we probably should do so, since it will be shortsighted enough to leave this problem until it becomes too obvious and urgent.

In the first three points, I’m more than 95% sure - this is just a discussion about the fact that if the tendencies of today's movement towards a specific goal continue, then as a result we will come to it. In the last two statements, I am less confident, about 50%.

Commentators generally agreed with these allegations. No one seriously tried to argue with p.p. 1-3, but many have argued that there is no point worrying about AI now. As a result, we got an extended analogy with illegal hacking of computers. This is a big problem that we could never solve completely - but if Alan Turing wanted to solve this problem in 1945, his ideas could be like “keep punch cards in a closed box so that they are not read by German spies.” Will the attempt to solve the problems associated with AI in 2015 end with about the same nonsense?

Maybe. But for several reasons, I allow myself to disagree with this. Some of them are quite general, so to speak, meta-level, some are more specific and objective. The most important reason for the meta-level is the following: if you accept paragraphs 1-3, that is, the possibility of the extinction of humanity if we can not solve the problem of the coincidence of our goals with the goals of AI, then you really think that our chances for progress in solving are this problems small? So small that we can say: "Yes, of course, we are moving towards self-destruction, but is it really an expenditure of resources to study the question of whether we can do something about it?" And what are those other amazing options for using resources that you like more? You can, of course, make arguments in the style of Pascal’s bet", but keep in mind that some professional boxer gets several times more for a fight than we spent on studying the risks associated with AI in the entire history of mankind!

If limiting AI would attract at least a tenth of that attention or a hundredth of that money attracted boxing matches with the participation of AI , the world would have been much calmer [puns: AI boxing - the conclusion of AI in an environment that limits its capabilities; but so can be called fictional boxing matches involving AI robots - approx.

But I would like to make an even stronger statement: the risks associated with AI are not just more important than boxing matches; this is just as important as all other things considered important, such as finding cures for diseases, finding dangerous asteroids, and preserving the environment. And therefore, it is necessary to prove that progress in this matter can be achieved even at such an early stage in the development of this area.

And I believe that progress is possible, because this problem lies in the field of philosophy, not technology. Our goal now is not to "write code that controls the future of AI," but to "understand what category of tasks we will face." Let me give you some examples of open issues in order to smoothly move on to a discussion about their current relevance.

II


Problem 1: electrodes and the brain


Electrodes have been implanted into the brains of some people - this is done for both therapeutic and research purposes. If the electrode enters certain parts of the brain, for example, the lateral part of the hypothalamus , a person has an irresistible desire to stimulate them as much as possible. If you give him a button for stimulation, he will press it a thousand times per hour. If you try to take this button from him, he will desperately and fiercely protect it. Their life and goals are compressed to the point, normal goals such as love, money, fame, friendship are forgotten, and all because of the desire to maximize electrode stimulation.

This coincides well with what we know about neurobiology. Rewards in the brain (ATTENTION: OVER-SIMPLIFICATION OF SIMPLIFICATION) are given out through the electrical voltage arising in a couple of centers of reward, and therefore the brain strives for everything that maximizes the receipt of rewards. This usually works well: after satisfying a biological need, such as food or sex, the reward center responds to this by fixing reflexes, and so you continue to satisfy your biological needs. But direct stimulation of reward centers with the help of electrodes works much stronger than just waiting for small awards received in a natural way, so this activity becomes by default the most rewarding. A person who has the opportunity to directly stimulate the reward center,

And this does not even require neurosurgery - drugs such as cocaine and methamphetamine are addictive in part because they interfere with the work of brain biochemistry and increase the level of stimulation of reward centers.

Computers may face a similar problem. I can’t find the link, but I remember the story of an evolutionary algorithm designed to create code in some application. He generated the code half randomly, then ran it through the “compatibility function”, which determined how useful it was, and the best sections of the code crossed with each other, mutating a little, until an adequate result was obtained.

The result, of course, was the code that cracked the compatibility function, as a result of which it produced some absurdly high value.

These are not isolated cases. Any thinking that works with reinforcement learning and a reward function — and this is apparently a universal pattern, both in the biological world and in the growing number of AI examples — will have a similar flaw. The main defense against this problem at the moment is the lack of opportunities. Most computer programs are not smart enough to “crack the reward function.” And in humans, reward systems are hidden deep in the head, where we cannot reach them. The hypothetical superintelligence will not have such a problem: he will know exactly where the center of his rewards is, and he will be smart enough to get to it and reprogram it.

As a result, unless we take conscious action to prevent it, it turns out that the AI ​​designed to treat cancer will crack its own module that determines how much cancer it has cured and set it to the maximum value possible. And then he will go in search of ways to increase memory so that it can store even more value in it. If it is superintelligent, then the options for expanding the memory can include “gaining control of all computers in the world” and “turning everything that is not a computer into a computer”.

This is not some kind of exotic trap into which some strange algorithms may fall; this can be a natural development path for a reasonably smart reinforcement learning system.

Problem 2: a strange theory of decision making


Pascal's bet is a well-known argument on why it is logical to join religion. Even if you think that the probability of the existence of God is vanishingly small, the consequences of your mistake (going to hell) are great, and the advantages if you are right (you can not go to church on Sundays) are relatively small - therefore, it seems beneficial to simply believe in God, just in case. Although quite a lot of objections were made to such an argument based on the canons of specific religions (does God want people to believe in him on the basis of such an analysis?), This problem can be generalizedto the point where it’s beneficial for a person to become an adherent of anything, simply because you promised him a huge reward. If the reward is large enough, it overcomes all the doubts of a person about your abilities to provide this reward.

This problem in decision theory is not related to intelligence issues. A very intelligent person will probably be able to calculate the probability of the existence of a god, and numerically estimate the flaws of hell - but without a good decision theory no intellect will save you from Pascal's bet. It is intelligence that allows you to carry out formal mathematical calculations that convince you of the need to make a bet.

People easily resist such problems - most people will not convince Pascal to bet, even if they do not find flaws in it. However, it is unclear why we have such resistance. Computers, notorious for relying on formal mathematics but lacking common sense, will not gain such resistance if they are not invested in them. And putting them into it is a difficult task. Most loopholes that reject Pascal’s bet without a deep understanding of what the use of formal mathematics leads to simply generate new paradoxes. A solution based on a good understanding of when formal mathematics ceases to work, while preserving the usefulness of mathematics in solving everyday problems, as far as I know, has not yet been worked out. What's worse, deciding Pascal's bet,

This is not just a cunning philosophical trick. A good enough “hacker” can overthrow the all-galactic AI, simply threatening (unproven) incredible damage if the AI ​​does not fulfill its requirements. If the AI ​​is not protected from such “Pascal betting” paradoxes, he will decide to fulfill the hacker's requirements.

Problem 3: The Evil Genius Effect


Everyone knows that the problem with computers is that they do what you tell them, not what you mean. Today it just means that the program will work differently when you forget to close the bracket, or the websites will look weird if you mix up the HTML tags. But this can lead to the fact that AI can misunderstand the orders given in natural language.

This is well shown in the story of the Age of Ultron . Tony stark orderssupercomputer Ultron to establish world peace. Ultron estimates that the fastest and most reliable way to do this is to destroy all life. Ultron, in my opinion, is 100% right, and in reality everything would have happened like that. We could get the same effect by setting AI tasks like “cure cancer” or “end hunger,” or any of thousands of similar ones.


The user is confident that the meteor that collides with the Earth will lead to the end of the feminist debate.

Even “The Three Laws of Robotics” by Isaac Asimov will only need 30 seconds to turn into something disgusting. The first law says that a robot cannot cause harm to a person, or by inaction leads to a person getting harm. “Do not overthrow the government” is an example of how people can get hurt through inaction. Or "do not lock every person in a stasis field forever."

It is impossible to formulate a sufficiently detailed order explaining exactly what is meant by “not allowing one’s inaction to harm a person”, unless the robot itself is able to do what we mean, and not what we say. This, of course, is not an insoluble problem, since a smart enough AI can understand what we mean, but our desire to have such an understanding must be programmed into the AI ​​directly, from scratch.

But this will lead to a second problem: we do not always know what we mean. The question "How to balance ethical prohibitions aimed at the safety of people with prohibitions aimed at preserving freedoms?" Now it is hotly debated in political circles, and appears everywhere, from control over the circulation of weapons to the prohibition of sugary drinks of large volume. Apparently, the balance of what is important to us and the combination of economy and sacred principles matter here. Any AI that is not able to understand this moral maze can end hunger on the planet by killing all starving people, or reject the invention of new pesticides for fear of killing the insect.

But the more you study ethics, the more you realize that it is too complex and resists simplification to a formal system that a computer could understand.Utilitarianism is almost amenable to algorithms, but it is not without paradoxes, and even without them you would need to assign utility to everything in the world.

This problem has yet to be solved in the case of people - the values ​​of most of them seem disgusting to us, and their compromises are losing. If we create an AI whose mind will not differ from mine any more than the mind of Pat Robertson , I will consider such a development a failure.

III


I did not raise these problems in order to hit anyone with philosophical questions. I wanted to prove several statements:

First, there are basic problems that affect a wide range of ways of thinking, for example, “everyone who learns with reinforcement” or “everyone who makes decisions based on formal mathematics”. People often say that at this stage you can not know anything about the design of future AI. But I would be very surprised if they did not use either reinforced learning or decision-making based on formal mathematics.

Secondly, for most people, these problems are not obvious. These are strange philosophical paradoxes, and not something that everyone with basic knowledge understands.

Thirdly, these problems have already been pondered. Someone, a philosopher, a mathematician, a neurobiologist, thought: “Listen, training with reinforcements is naturally prone to the problem of implantation of electrodes, which explains why the same behavior can be traced in different areas.”

Fourth, these problems indicate the need to conduct research now, even if preliminary. Why do people resist Pascal's betting so well? Is it possible to reduce our behavior in situations with high utility and low probability to reduce to a function, using which the computer would make the same decision? What are the best solutions for solution theory problems related to this topic? Why is a person able to understand the concept of implantation of electrodes, and does not seek to get such an electrode personally for his brain? Is it possible to develop a mind that, using such an electrode, will understand all the sensations, but will not feel the desire to continue? How to formalize the ethics and priorities of people enough to shove them into a computer?

It seems to me that when they hear that “we should start working on the problem of coinciding goals with AI right now”, they think to themselves that someone is trying to write a program that can be directly imported into the 2075 AI edition to ask him an artificial conscience. And then they think: "Yes, you can’t do such a complicated thing so early for nothing."

But no one offers this. We suggest that you familiarize yourself with the general philosophical problems that affect a wide range of ways of thinking, and carry out the neurobiological, mathematical, and philosophical studies necessary to understand them by the time engineering problems appear.

By analogy, we are still very far from creating spaceships that travel even at half the speed of light. But we already know what problems a ship traveling faster than light can face (theory of relativity and limiting the speed of light) and have already generated several ideas to get around them ( Alcubierre bubble ). We are not yet able to build such an engine. But if by 2100 we find how to build ships approaching the speed of light, and for some reason, the fate of the planet will depend on the availability of ships moving faster than light, by 2120, it will be wonderful to realize that we have done all the work with the theory of relativity in advance, and don’t waste valuable time discussing the basics of physics.

The question “Can we conduct an AI security research now” is stupid, since we have already done a certain amount of research in this area. They led to an understanding of issues such as the three above, and others. There are even a few answers to the questions, although they are given at technical levels, much lower than any of these questions. Each step taken today allows us not to waste time on it in the future in a hurry.

IV


There remains my statement at number five - if we can conduct research on AI today, we should conduct it, because we can’t expect our descendants to conduct these studies in a hurry without our help, even using our own, improved model of what is AI and what does it look like. And I have three reasons for this.

Reason 1: treacherous turn


The AI ​​models of our descendants can be deceiving. What works for the intellect below or at the human level may not work for the superhuman. Empirical testing will not help without the support of theoretical philosophy.

Poor evolution. She had hundreds of millions of years to develop protection against heroin - which affects rats in much the same way as humans - but she had no time. Why? Because until the last century there was no one smart enough who could synthesize pure heroin. So addiction to heroin was not a problem that evolving organisms would face. The brain's working pattern, which works well in stupid animals such as rats or cows, becomes dangerous when it comes to people smart enough to synthesize heroin or insert electrodes into pleasure centers.

The same goes for AI. Dog-level AI does not learn how to crack its reward mechanism. This, perhaps, cannot be done by AI at the human level - I would not be able to crack the mechanism of the robot’s rewards if they gave me it. Superintelligence can. We may encounter AI learning with reinforcements that work perfectly at the dog level, well at the human level, and then suddenly explode, moving to the level of superintelligence - and by then it is too late to stop them.

This is a common property of failing AI security modes. If you tell an ordinary person to take care of world peace, the best thing I could do is become the UN Secretary General and learn how to negotiate. Give me a few thousand nuclear warheads and everything will turn out differently. Human-level AI can pursue the goals of world peace, treating cancer, not allowing people to suffer damage through inaction, in the same ways that people do, and then change these paths when it suddenly turns into a smart one and sees new opportunities. And this transition will occur at the point at which people can no longer stop it. If people have the opportunity to simply turn off AI, then the most effective way for him to rid humanity of cancer will be to search for drugs. If they cannot turn it off,

In his book, Nick Bostrom calls such a scheme a “treacherous turn”, and he condemns to failure anyone who wants to first wait until AI appears and then solve their moral failures by trial, error and observation. It would be better to get a good understanding of what is happening, then to predict these turns in advance, and to develop systems that avoid them.

Reason 2: hard start


Nathan Taylor of Praxtime writes :
Perhaps most of the current debate about the risks of AI is simply an option for one more fundamental debate: an easy start or a hard start.

An easy start is the progress of AI, going from a level below the human, to the level of a dumb person, to an even more intelligent person, and then to the superhuman, slowly, over many decades. A hard start is the rapid progress of AI, from a few days to several months.

In theory, if you connect a human-level AI to a calculator, you can raise it to the level of a person who can count very quickly. If you connect it to Wikipedia, you can transfer to him all the knowledge of mankind. If you connect it to a drive of several gigabytes, you can make it the owner of the photographic memory. Give additional processors, and speed it up many times - so that the problem, which will take a person a whole day to solve, will take this AI for 15 minutes.

We have already moved from "intelligence at the human level" to "intelligence at the human level with all the knowledge, photographic memory, fast computing, solving problems a hundred times faster than people." And it already turns out that this “intelligence at the human level” is no longer at the human level.

The next problem is recursive self-improvement. Perhaps this human-level AI with photographic memory and tremendous speed will learn programming. With his ability to absorb textbooks in seconds, he will become an excellent programmer. This will allow him to correct his own algorithms in order to increase his intelligence, which will allow him to see new ways to become more intelligent, etc. As a result, he will either reach a natural maximum, or become super smart in one instant.

In the second case, the method of “waiting until the first intelligence of the human level appears, and then testing it” does not work. The first intelligence of the human level will quickly turn into the first intelligence of the superhuman level, and we will not have time to solve even one of the hundreds of problems associated with the conformity of our goals.

I have not yet met such arguments, but I would say that even in the case of a hard start, we can underestimate the risks.

Imagine for some reason, from the point of view of evolution, it would be cool to have two hundred eyes. 199 eyes do not help, they are no better than two, but if creatures with 200 eyes suddenly appear, they will forever become the dominant species.

The most difficult thing with a 200-eye question is to get as a result of the evolution of the eye in principle. After that, getting 200 eyes is very easy. But whole epochs can pass before any organism reaches a state with 200 eyes. A few dozen eyes are wasting energy, so evolution may in principle not reach the point at which 200 eyes will appear for someone.

Suppose the same thing works with intelligence. It is very difficult to evolve to a tiny rat brain. And from now on, getting a human brain that can dominate the world will only be a matter of scaling. But since the brain spends a lot of energy and was not so useful until the discovery of technology, its appearance took a very long time.

There is plenty of evidence for this. First, humans evolved from chimpanzees in just a few million years. It is too small to rework the mind from scratch or even to invent new evolutionary technologies. This is enough to change the scale and add a couple of effective changes. But monkeys existed tens of millions of years before that.

Secondly, dolphins are almost as smart as humans. But our common ancestor lived with them about 50 million years ago. Either humans and dolphins evolved 50 million years independently, or the most recent common ancestor possessed everything necessary for intelligence, and people with dolphins are just two animal species in a large family tree for which the use of intelligence to its fullest has become useful. But this ancestor was most likely no smarter than the rat himself.

Third, humans can frighteningly increase intelligence under the pressure of evolution. According to Cochran , Ashkenazi IQ has grown by 10 points every thousand years. People suffering from torsion dystonia can gain 5-10 IQ points due to one mutation. All this suggests that the intellect is easy to change, but evolution has decided that it is not worth developing, except in separate, special cases.

If so, then the first AI comparable in level to rats will already contain all the interesting discoveries needed to build a human-level AI and the first super-smart AI. Usually people say that "Well, yes, maybe we will soon make rat-level AI, but it will take a long time before its level becomes equal to the human one." But this assumption is based on the fact that it is impossible to turn rat intelligence into human just adding processors or more virtual neurons or their connections or something else. After all, the computer does not need to worry about the limitations associated with metabolism.

Reason 3: time limits


Bostrom and Müller interviewed AI researchers about when they expect human-level AI to appear. The median of predictions fell to 2040 - this is 23 years later.

People pondered Pascal's bet for 345 years, and did not come up with a generalized solution to this paradox. If this is a problem for AI, we have 23 years left to solve not only this problem, but in general the whole class of problems associated with AI. Even excluding options like an unexpected hard start or treacherous turns, and accepting the hypothesis that we can solve all problems in 23 years, this period does not seem so long.

During the Dartmouth Conference on AI in 1956, the best researchers made a plan to achieve the human level of intelligence and set a deadline of two months for teaching computers to understand the human language. Looking back, this seems like a slightly optimistic outlook.

But now computers have already learned how to translate texts more or less tolerably, and are developing quite well in the field of solving complex problems. But when people think about things like decision theory or implantation of electrodes or alignment of goals, they simply say: “Well, we have plenty of time.”

But to expect that these problems can be solved in just a few years, perhaps it will also be optimistic how to solve the problem of machine translation in two months. Sometimes tasks are more difficult than you expected, and it is worth starting to deal with them earlier, just in case.

All this means that theoretical research into the risks of AI should begin today. I am not saying that all the resources of civilization should be thrown at this, and I heard that some believe that after the grant of $ 10 million from Mask, this problem has become less urgent. I think that there is no longer a problem with public awareness of this fact. The average person watching a movie about killer robots is more likely to do harm than good. If there is a problem, then it is that smart people from the right fields of knowledge - philosophy, AI, mathematics, neurobiology - can spend their time solving these problems and convince their colleagues of their seriousness.

Also popular now: