MIT course "Computer Systems Security". Lecture 10: "Symbolic execution", part 1
Massachusetts Institute of Technology. Lecture course # 6.858. "Security of computer systems." Nikolai Zeldovich, James Mykens. year 2014
Computer Systems Security is a course on the development and implementation of secure computer systems. Lectures cover threat models, attacks that compromise security, and security methods based on the latest scientific work. Topics include operating system (OS) security, capabilities, information flow control, language security, network protocols, hardware protection and security in web applications.
Lecture 1: “Introduction: threat models” Part 1 / Part 2 / Part 3
Lecture 2: “Control of hacker attacks” Part 1 / Part 2 / Part 3
Lecture 3: “Buffer overflow: exploits and protection” Part 1 /Part 2 / Part 3
Lecture 4: “Privilege Separation” Part 1 / Part 2 / Part 3
Lecture 5: “Where Security System Errors Come From” Part 1 / Part 2
Lecture 6: “Capabilities” Part 1 / Part 2 / Part 3
Lecture 7: “Native Client Sandbox” Part 1 / Part 2 / Part 3
Lecture 8: “Network Security Model” Part 1 / Part 2 / Part 3
Lecture 9: “Web Application Security” Part 1 / Part 2/ Part 3
Lecture 10: “Symbolic Execution” Part 1 / Part 2 / Part 3
Good morning to everyone, I am Armando Solar-Lesama and today I will read a lecture on symbolic execution. Who among those present here is familiar with this term or have you heard about it before? I just want to get an idea of our audience. So, let's begin. I dropped my laptop several times, so it takes a long time to load.
Symbolic execution is the workhorse of modern program analysis. This is one of the methods that grew out of research and then began to be used in many applications. For example, today Microsoft has a system called SAGE, which works with a large number of important Microsoft software, ranging from Power Point to Windows itself, in order to really find security problems and vulnerabilities.
There are many academic projects that have had a major impact on the real world, for example, the detection of important bugs in open source software using symbolic execution. And the beauty of symbolic execution as a technique is that, compared with testing, it gives you the opportunity to imagine how your program will behave with a potentially infinite set of possible input data. This allows you to explore arrays of input data, which would be completely inappropriate and impractical to investigate, say, by random testing, even with a very large number of testers. On the other hand, compared with more traditional methods of static analysis, it has the following advantage. When investigating a problem, symbolic execution can create input and trace, a runtime path that can be run in a real program and execute this program based on this input. And after that, we can identify the real bug and start fixing it using traditional debugging mechanisms. And this is especially valuable when you're in an industrial development environment where you probably don't have time to take care of every little problem in your code.
For example, you really want to be able to distinguish real problems from false positives. How does this work?
To really understand how this works, it’s helpful to start with a regular implementation. If we think of symbolic execution as a generalization of traditional, simple execution, it makes sense to know what it looks like. Therefore, I am going to use this very simple program as an illustration for many things that we will talk about today.
Here we have an excerpt of a very simple code from several branches and the statement that if under some condition the value t <x, then this is a false statement. We want to find out if this statement can ever be triggered. Is it possible Are there any inputs that will make this statement fail?
One of the things that I can do is to check the execution of this program using specific input values as an example. Suppose we use input data for which X = 4 and Y = 4. The value of T is zero, as declared at the beginning of the program.
So, before we get down to normal execution, let's find out what is the important point here? We need to have some idea of the state of the program, right? Whether we do normal execution or whether we do symbolic execution, we need to have some way to characterize the state of the program. In this case, it is such a simple program that it does not use a bunch, does not use a stack, and there are no function calls here.
Thus, the state can be fully characterized by these three variables along with the knowledge of where I am in the program. Therefore, if I start execution with 4, 4 and 0 and get to the end of the branch, then I check the expression: 4 is greater than 4? Obviously not.
Now I am going to run the program at T = Y, that is, T is no longer equal to 0, but has a value of 4. This is the current state of my program, and now I can evaluate this branch.
Is it correct that t <x? Not. We dodged a bullet, asserting false didn't work. There were no problems with this private execution.
But this does not tell us anything about any other implementation. We know that for X = 4 and Y = 4, the program will not fail. But this tells us nothing about what will happen if the input values are 2 and 1.
With these input values, execution will follow a different path. This time we see that T = X, and after executing this line, T will take a value equal to 2. Are there any problems with this implementation? Will there be an assertion error with such input data?
Well, let's see. So, if T is 2 and X is 2, then T is no less than X. It looks like we dodged a bullet again. Right? So, here we have two specific input values at which the program works without errors. But in reality, this does not tell us anything about any other input values.
So, the idea of symbolic execution is that we want to go beyond the execution of a program with one set of input data. We want to be able to actually argue about the behavior of the program when using a very large data set, in some cases an infinite set of possible input values. The main idea of this is as follows.
For a program like this, its state is determined by the value of these three different variables: X, Y, and T, and knowing where in the program I am at the moment. But now instead of specific values for X and Y, I will have a symbolic value, just a variable. A variable that allows me to name this value that the user uses as input. This means that the state of my program is no longer characterized by matching variable names to specific values. Now this is a mapping of variable names to these character values.
Character value can be considered as a formula. In this case, the formula for X is equal to X and the formula for Y is simply Y, and for T it is actually equal to 0. We know that for each input value it does not matter what you are doing. The value of T after the first statement will be 0.
This is where it becomes interesting now. We have reached this branch, which says that if X is greater than Y, we will go in one direction. If X is less than or equal to Y, we will go in a different direction.
Do we know anything about X and Y? What do we know about them? At least, we know their type, we know that they will vary in the range from min int to max int, but that’s all we know about them. It turns out that the information we know about them is insufficient to say in which direction this branch can go. It can go in any direction.
There are many things we can do, but what can we do at the moment? Try to make the wildest assumption.
Audience: we can follow the program on both branches.
Professor: yes, we can trace the execution along both branches. Toss a coin, and depending on how it falls, choose one branch or another.
So, if we want to follow both branches, we must follow one and then the other, right? Suppose we start with this branch - T = X. We know that if we get to this place, then T will have the same meaning as X. We do not know what this value is, but we have a name for it — it is an X script.
If we take the opposite branch, then will happen? The value of T will be equal to something else, right? In this branch, the value of T will be the symbolic value of Y.
So what does this value of T mean when we get to this point in the program? Maybe it's X, maybe it's Y. We don't know exactly what that value is, but why don't we give it a name? Let's call it t 0 . And what do we know about t 0 ? When does t 0 be equal to X?
Essentially, we know that if X is greater than Y, then the variable is equal to X, and if X is less than or equal to Y, then the variable is equal to Y. Therefore, we have a value that we have defined, let's call it t 0 , and it has these logical properties.
So, at this point in the program, we have a name for the value of T, this is t 0 . What did we do here? We took both branches of the if statement, and then we calculated the symbolic value, looking at the conditions under which one branch of the program would be executed, and under which conditions the other.
Now it comes to the fact that we have to ask whether T can be less than X. Now the value of T is t 0 , and we want to know if it is possible that t 0would be less than X? Remember the first branch that we looked at - we asked the question about X and Y and did not know anything about them, except that they had type int.
But having t 0 , we really know a lot about it. We know that in some cases it will be equal to X, and in some cases it will be equal to Y. So now it gives us a set of equations that we can solve. So we can say, is it possible that t 0 is less than X, knowing that t 0 satisfies all these conditions? Thus, it can be expressed as a restriction, indicating whether it is possible that t 0 is less than X. And if X is greater than Y, then t 0 is X, and if X is less than or equal to Y, this means that t 0 = Y.
So, we have an equation. If it has a solution, if you can find the value of t 0 , the value of X and the value of Y, which satisfy this equation, then we will know these values, and when we enter them into the program, then when executed, it will go along this branch if t <x and " will blow up "when it gets into assert false.
So what have we done here? We ran the program, but instead of matching the variable names to specific values, we gave these variable names symbolic values. In fact, they gave them other variable names. And in this case, our other variable names are the X script, the Y script, t 0 , and in addition, we have a set of equations that show how these values are related. We have an equation that tells us how t 0associated with X and Y in this case.
The solution of this equation allows to answer the question whether this branch can be executed or not. Let's look at the equation - is it possible to take this branch or not? It seems that there is not, because we are looking for cases where t 0 is less than X, but if in the first condition t 0 = X, then the expression t 0 <X is not true.
Thus, this means that when X> Y, this cannot happen, because t 0 = X and it cannot be equal to and smaller than X at the same time.
And what happens if t 0 = Y? Can t 0 be less than X in this case?
No, definitely not, because we know that X <Y. So if t 0will be less than X, then it will also be less than Y. But we know that in this case t 0 = Y. And therefore, again, this condition cannot be satisfied. So here we have an equation that has no solution, and it doesn’t matter which values you include in this equation.
You cannot solve it, and this tells us that no matter what input data X and Y we pass to the program, it will not go down the if t <x branch.
Now notice that, while creating this argument here, I basically hinted at your intuition about integers, about mathematical integers. In practice, we know that machine int behave differently than mathematical int. There are cases when the laws applicable to mathematical integer data types are not applicable to software int.
Therefore, we must be very careful when we solve these equations, because we need to remember that these are not the integers that we were told in elementary school. These are 32-bit integers used by the machine. And there are many cases of errors that have arisen because programmers thought about their code from the point of view of mathematical integers, not realizing that there are such things as overflows that can cause different program behavior for mathematical input data.
Another thing that I described here is a purely intuitive argument. I take you through the process, showing you how to do it manually, but this is by no means an algorithm. The beauty of this idea of symbolic execution, however, lies in the fact that it can be encoded into an algorithm. And you can solve it mechanically, which allows you to do this not just for a program of 10 lines, but for millions of programs. This allows you to use the same intuitive reasoning that we used in this case, and talk about what happens when we run this program with different values of input. And these reasonings can be scaled and extended to very large programs.
Audience: what if the program does not support entering certain types of variables?
Professor:This is a very good question! Suppose we have the same program, but instead of t = x we will have t = (x-1). Then, intuitively, we can imagine that now this program can “explode”, isn't it?
Because when the program goes this way, t really will be less than x. What happens with such a program? What would our character state look like? What is t 0 when x is greater than y? We correct the lines in our equations in accordance with a different value when t = (x-1). Now the program may fail, and you go to the developer and say to him: “hey, this function may explode when x is greater than y”!
The developer looks at it and says: “Oh, I forgot to tell you - in fact, this function will never be called with parameters where x is greater than y. I just wrote it for some historical reasons, so don’t worry, I wouldn’t even remember if you hadn’t told me. ”
Suppose we have an assumption that x will be less than or equal to y.
This is a prerequisite or agreement for our function. The function promises to do something, but only if the value satisfies this assumption. But if it is not satisfied, the function says: “I don’t care what happens. I promise that there will be no error only when this assumption is fulfilled. ”
So, how do we encode this constraint when solving equations? Essentially, we have a set of constraints that tell us whether this branch is feasible. And besides the limitations, we must also make sure that the precondition, or assumption, is fulfilled.
The question is, can I find x and y that satisfy all these constraints and at the same time have the required properties? You can see that this restriction X ≤ Y represents the difference between the case when this restriction is satisfied and the case when it is not satisfied.
This is a very important issue when working with analysis, especially when you want to do it simultaneously at the level of individual functions. It is advisable to know what the programmer meant when writing this function. Because if you have no idea about these assumptions, you might think that there are some input data for which the program will fail.
How to make it mechanically? There are two aspects to this problem. Aspect number one - how did you actually come up with these formulas?
In this case, it is intuitively clear how we arrived at these formulas, we simply compiled them manually. But how to create these formulas mechanically?
And the second aspect - how do you solve these formulas after you have them? Is it possible to actually solve these formulas that describe whether your program crashes or not?
Let's start with the second question. We can reduce our problem with these formulas, which include integer arguments and bit vectors. Creating programs, you care about arrays, about functions and as a result receive huge formulas. Is it possible to solve them mechanically?
The many technologies we are talking about today are practical tools related to the tremendous advances in the development of solvers for logical questions. In particular, there is a very important class of solvers, called SMT, or "solver of the solvability of modular theories." An SMT solver is a problem of solvability of logical formulas, taking into account the theories underlying them.
Many people claim that this name is not particularly good, but it stuck as the most commonly used.
An SMT solver is an algorithm by which a given logical formula at the output will give you one of two options: either it satisfies its purpose, or it does not. The second case means that in this formula it is impossible to use the values of variables that meet the requirements of the restrictions you set.
In practice, this sounds a bit scary and a little magical. Most of the problems that SMT must solve are NP-complete problems, that is, problems with the answer “yes” or “no”.
Can we have a system that relies, as its main building block, on solving NP-complete problems? Or do we need something else that works in practice? The fact is that many SMT solvers also have a third possible answer: “I don't know.”
And therefore, part of the beauty of these solvers is that, even solving very large and complex practical problems, they are capable of more than just answering: “I don't know.” They are able to give you a guarantee that this set of restrictions is unsatisfactory or fully satisfies its purpose, that is, they give you a completely accurate answer.
The course MIT "Security of computer systems." Lecture 10: "Symbolic execution", part 2
Full version of the course is available here .
Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to friends, 30% discount for Habr's users on a unique analogue of the entry-level servers that we invented for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $ 20 or how to share the server? (Options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).
VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps until December for free if you pay for a period of six months, you can order here .
Dell R730xd 2 times cheaper? Only here2 x Intel Dodeca-Core Xeon E5-2650v4 128GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about How to build an infrastructure building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny?