MIT course "Computer Systems Security". Lecture 21: "Tracking data", part 2
Massachusetts Institute of Technology. Lecture course # 6.858. "Security of computer systems." Nikolai Zeldovich, James Mykens. year 2014
Computer Systems Security is a course on the development and implementation of secure computer systems. Lectures cover threat models, attacks that compromise security, and security methods based on the latest scientific work. Topics include operating system (OS) security, capabilities, information flow control, language security, network protocols, hardware protection and security in web applications.
Lecture 1: “Introduction: threat models” Part 1 / Part 2 / Part 3
Lecture 2: “Control of hacker attacks” Part 1 / Part 2 / Part 3
Lecture 3: “Buffer overflow: exploits and protection” Part 1 /Part 2 / Part 3
Lecture 4: “Privilege Separation” Part 1 / Part 2 / Part 3
Lecture 5: “Where Security System Errors Come From” Part 1 / Part 2
Lecture 6: “Capabilities” Part 1 / Part 2 / Part 3
Lecture 7: “Native Client Sandbox” Part 1 / Part 2 / Part 3
Lecture 8: “Network Security Model” Part 1 / Part 2 / Part 3
Lecture 9: “Web Application Security” Part 1 / Part 2/ Part 3
Lecture 10: “Symbolic execution” Part 1 / Part 2 / Part 3
Lecture 11: “Ur / Web programming language” Part 1 / Part 2 / Part 3
Lecture 12: “Network security” Part 1 / Part 2 / Part 3
Lecture 13: “Network Protocols” Part 1 / Part 2 / Part 3
Lecture 14: “SSL and HTTPS” Part 1 / Part 2 / Part 3
Lecture 15: “Medical Software” Part 1 / Part 2/ Part 3
Lecture 16: “Attacks through a side channel” Part 1 / Part 2 / Part 3
Lecture 17: “User authentication” Part 1 / Part 2 / Part 3
Lecture 18: “Private Internet viewing” Part 1 / Part 2 / Part 3
Lecture 19: “Anonymous Networks” Part 1 / Part 2 / Part 3
Lecture 20: “Mobile Phone Security” Part 1 / Part 2 / Part 3
Lecture 21: “Data Tracking” Part 1 /Part 2 / Part 3
Student: why can't I just scan the code and not check it manually?
Professor: in practice it happens. Developers know that whenever the interpreter does this kind of work, when returning the return value, a special code is used that automatically assigns an infected value to the system.arraycopy () operation, which should be associated with it.
Student: correct, but then what is the manual part of the work?
Professor:the manual part is basically to figure out what the validation policy should be. In other words, if you just look at standard TaintDroid or standard Android, they will do something for you, but they will not be able to automatically assign Taint in the correct way. So someone must manually assign a tracking policy.
This does not seem to be a big problem in practice. But if the number of applications that use computer-oriented methods has steadily increased, then we may have small problems.
Another type of data that is worth worrying about in terms of assigning an infection is the IPC message. IPC messages are essentially treated as arrays. Therefore, each of these messages will be associated with a single common taint, which is the union of infections of all components.
This contributes to the effectiveness of the system, because we need to store only one taint tag for each of these messages. In extreme cases, the extent of the infection will simply be overestimated, but it will never lead to a deterioration in security. The worst thing that can happen in this case is that the network will not get data that could get there without any dangerous consequences for confidentiality.
So, when you create an IPC message, it receives a combined taint. When you read what you received in this message, the extracted data gets an infection from the message itself, which makes sense. This is how IPC messages are processed.
Also it is worth worrying about how the file is processed. Therefore, each file receives one taint tag, and this tag is stored along with the file in its metadata on a stable storage medium, such as an SD memory card. Here is the same conservative approach to infection, as in previous cases. The basic idea is that the application gets access to some sensitive data, for example, the GPS location, and may be going to write this data to a file, so TaintDroid updates the taint tag of this file with the GPS flag, after which the application closes. Later, perhaps, some other application that reads this file is included.
When he gets into the virtual machine, into the application, TaintDroid sees that he has this flag, and therefore any data retrieved while reading this file will also have this GPS flag. I think it's pretty simple.
So what kind of things can we infect from a Java perspective? Basically there are five types of Java objects that need taint flags. First, these are local variables Local variables that are used in the method. Returning to the previous examples, we can assume that such a variable is char c.
Therefore, we must assign flags to these elements. The second type is the arguments of the Method arguments method, they must also have infection flags. Both of these things live on the stack, so TaintDroid should keep track of the purpose of the flags and whatnot for these types of objects.
We also need to assign flags to the instance fields of the Object instance fields. Imagine that there is a certain object C, this is a circle, and I want to know its radius. Thus, we have the c.radius field, and we need to link information about the infection of each of these fields: with and radius.
The fourth type of Java object is the static class fields, which also need taint information. This could be something like circle.property, that is, a description of the properties of the circle for which we assign some information taint.
The fifth type is the Arrays arrays that we talked about earlier, and we assign one common piece of information about the infection to the entire array.
The basic idea of storing taint flags for these types of Java objects is that we try to keep the taint flags for a variable next to the variable itself.
Let's say we have some kind of integer variable, and we want to put some taint pollution with it. We want to try to keep this state as close as possible to a variable, perhaps for reasons of ensuring that the cache operates efficiently at the processor level. If we stored taint very far from this variable, it can cause problems, because after the interpreter looks at the value of memory for this actual Java variable, it will want to read information about its infection as quickly as possible.
If we look at the move-op operation, we note that in these places of the code, dst and src, when the interpreter considers the values, it also considers the corresponding taint infections.
Thus, by placing these things as close as possible to each other, you are trying to ensure more efficient use of the cache. This is done quite simply. If you look at what developers are doing for the arguments of the methods and the local variables that live in the stack, you can see that they essentially select the taint flags near the place where the variables are located.
Suppose that we have a favorite thing in our lectures, a stack diagram, which you will probably soon hate for being frequently mentioned. Let local variable 0 be located in our stack, then TaintDroid will store in memory a tag about infection of this variable right below it. If you then have another variable, its tag will also be located directly below it, and so on. It is quite simple. All these things will be located on the same cache line, which will make accessing memory less expensive.
Student: I wonder how you can have one flag for a whole array and different flags for each property of an object. What if one of the object's methods can access the data stored in its properties? That would be ... see what I mean?
Professor:Are you asking about the reason for applying such a policy?
Student: yes, about the reason for using such a policy.
Professor: I think this is done to ensure effective implementation. Probably, there are other rules - for example, they do not report the length of the data array, because information leakage is possible, so they do not spread the infection to this indicator. So I think some decisions are made simply for efficiency reasons. In principle, there is nothing that would prevent the provision of access to each element of the array to indicate that the item to the left receives taint only from some specific elements.
However, it is not clear whether this will be correct, because, apparently, if you place something in an array, then this thing should know something about this array. Therefore, I think that developers use a combination of both policies. Being overly conservative, you should not allow data leaks that you want to protect, but at the same time, to have access to an array, you need to know something about it. And when you need to learn something about something, it usually means that you are using taint.
So this is the basic scheme they use to store all this information next to each other. It can be imagined that the same is done for the fields of the classes and for the fields of the objects. When you declare a class, you have some slot memory for a particular variable, and right next to this slot is information about taint for this variable. So I think this is all pretty reasonable.
This is how TaintDroid works. When the system is initialized or at other times the system is running, TaintDroid looks at all the sources of potentially infected information and assigns a flag to each of these things — the GPS sensor, the camera, and so on. As the program progresses, it will extract confidential information from these sources, after which the interpreter will consider all types of functions in accordance with the table provided in the article to figure out how to spread taint infection through the system.
The most interesting thing happens when data tries to penetrate the system. TaintDroid can control network interfaces and see everything that it tries to go through. He looks at the presence of taint tags, and if the data that is trying to penetrate the network have one or more of these flags, they will be prohibited from using the network. What happens at this moment actually depends on the application.
For example, TaintDroid may show a warning to the user, which says that someone is trying to send information about his location to the side. It is possible that TaintDroid contains built-in policies that allow an application to go online, but at the same time nullify all sensitive data that it will try to transfer, and so on. In the article, this mechanism is not described in sufficient detail, since the authors were primarily concerned with the issue of “leakage” of data into the network.
In the section of the article entitled “Assessment”, some of the things found in the process of studying the operation of the system are discussed. So, the authors of the article found that Android applications would try to extract data in ways that the user would not see. Suppose they try to use your location for advertising, they send your number to a remote server, and so on. It is important to note that these applications, as a rule, do not “break” the Android security model in the sense that the user must allow them access to the network or allow them to use the contact list. However, applications do not provide information in the EULA license agreement that they are going to send a phone number to some Silk Road 8 server or something like that. In fact, this is a hoax and confusion of users regarding the true intentions of the application,
Student: it can be assumed that even if they had placed these requirements in a licensing agreement, this would not work, because people usually do not read the EULA.
Professor: this is quite a reasonable assumption, because even computer science experts do not always check the license agreement. However, such honesty in the EULA would still be beneficial, because there are people who actually read the license agreement. But you are absolutely right, assuming that users will not read a bunch of pages written in small print, they will simply click "agree" and install the application.
So, I think that the rules for passing information through the system are quite simple, as we have said, taint simply moves from the right side to the left side. However, sometimes these rules for the flow of information may have somewhat contradictory results.
Imagine that an application implements its own class of linked lists. We have a simple class called ListNode, it will have an object field for Object data and a ListNode next object that represents the next list.
Suppose that an application assigned infected data to the Object data field — confidential information received from a GPS sensor or something else. The question is that when we calculate the length of this list, should it be infected? It will amaze you that the answer to the question will be “no”, which is explained by the way TaintDroid and some of these systems define the information flow. Let's take a look at what it means to add a node to a linked list.
Adding a node consists of 3 steps. So the first thing you do is select a new list node that contains the data you want to add — Alloc ListNode. The second step is you assign the data field of this new node. And the third thing you do is use some kind of patch for ListNode next to merge nodes into a list — this is the “next” ptr pointer.
Interestingly, the third step is not related to the data field at all; it simply considers the following value. As soon as the Object data is infected, we begin to calculate the length of the list, starting with some head node, go through all the “next” ptr pointers and simply count how many pointers have passed. So the counting algorithm does not touch the infected data.
Interestingly, if you have a linked list that is filled with infected data and then its length is calculated, this will not generate an infected value. This may seem a little illogical, although considering arrays, we have already said that the length of the array also does not contain taint. Here is the same reason. Toward the end of the lecture, we will discuss how you can use a language that allows you as a programmer to determine your own types of infection, and then you can develop your own policy for such things.
The good thing about TaintDroid is that you, as a developer, don't have to label anything, TaintDroid does it for you. He notes all confidential things that can be a source of information, and all things that can be "sinks" of information, so you, as a developer, are ready to work. But if you want to control the addition of nodes, you may have to create some policies yourself.
How does TaintDroid affect system performance? Existing overhead actually seems pretty reasonable. The memory overhead is in storing all tags of infections. The load on the processor will mainly consist of the purpose, distribution and verification of these infections, and it should be noted that the use of the Dalvik virtual machine is an additional job. So, looking at the source, looking at this 32-bit information about the infection, we perform operations that have already been considered. This is computational overhead.
These overheads seem fairly moderate. As the authors of the article report, storing taint tags requires from 3% to 5% additional memory, so this is not so bad. The load on the processor is somewhat higher and can reach from 3% to 29% of computing power. This is due to the fact that each time the loop is executed, the interpreter needs to look at these tags and perform the corresponding operations. Although these are just bitwise operations, they need to be performed all the time. This is not bad even in the case of a 29% load, because developers from Silicon Valley constantly say that modern phones require quad-core processors. The only problem may arise with the battery, because even if you have additional cores of the processor, you will hardly want to have a hot phone in your pocket, which will begin to "bang" when trying to calculate these things. But if your battery is not particularly affected by such calculations, then everything is not so bad.
So this was a review of the work of TaintDroid. Have questions?
Student: tags are marked only what is there all the time? Or is each variable tagged?
Professor:everything is marked, so theoretically nothing prevents you from placing any information about the infection for things that have no infection at all. I think that as soon as something gets at least one bit of taint, you need to build something like a dynamic layout of changes. Because if some local parameter in the stack is infected, then you select the entire stack, which is now marked as infected. Or you have this additional flag taint in the heap, and you watch how it overwrites the stack, and then someone uses your code, then you will need to re-check how it works. Thus, in practice, typical usage is similar to shadow memory, therefore each byte in the application is reserved by some additional information byte. In the case of TaintDroid, this shading actually lives near the variable itself.
Consider the question - is it possible to track taint at the level of x86 instructions or ARM instructions? This is useful because then we can understand how information flows through arbitrary applications, and not only through those that run inside a virtual machine, which requires that you run Java and so on. So why don't we track infection at this level?
It turns out that this can be done because there are low-level infection tracking projects that we have already reviewed. The positive thing is that you can increase your reach without delving into the heuristics of how, for example, Java code interacts with machine-oriented methods. All of this will ultimately lead to the implementation of the x86 instructions, which will save you from the manual labor associated with the fact that you, as a developer, will have to figure out how to use the semantics when creating your own methods of tracking infections.
But the problem is that if you monitor taint at such a low level, it can be too time consuming, and you can get a lot of false positives. It is worth mentioning the low correctness of such methods.
As you know, x86 is a complex set of instructions. It allows you to do absolutely crazy things. I don’t know if you have ever seen x86 manuals, they are huge. For example, you will have one huge guide, so thick, with what it will be posted articles from the letter M on the letter P, and to understand the question, you will need to have a whole series of such volumes, I'm not kidding!
So it's actually quite difficult to think about keeping track of taint at the level of x86 instructions. Because even seemingly such simple instructions, like some AD commands, set all types of processor register files, flags, and the like.
So, firstly, it is very difficult to describe, which can lead to incorrect program execution. Secondly, even if you could do it, it will be very expensive. You view things at a very, very low level, so the number of states you need to monitor can grow rapidly.
This can be a very sensitive computational process, fraught with false positives and serious problems if it is incorrectly executed. For example, if you have incorrectly infected kernel data. If your infrastructure tries to be ultraconservative and does not want to miss anything, telling yourself that it is better to err in the direction of increasing security, it will infect some of the core data structure. After this, you will get what the authors of the article called the fascinating term “infected blast” Taint Explosion.
This means that at a certain point, infected items are included in such a huge amount of calculations that they can infect the entire program. It looks like what happens in the game Dungeons and Dragons, when you touch a dangerous thing and in the end death spreads throughout your body.
This is very bad, because if you cannot severely limit the course of infection through the system, then after a while the system will refuse to do anything at all. You will not be able to send anything over the network, will not be able to display anything on the screen, because everything in your system will be marked as infected confidential information, even if it is not.
This can occur if the esp stack pointer or esb interrupt pointer is somehow infected. If this happens, you will be in big trouble.
Imagine that all x86 instructions that access the stack go through esp. Thus, if the register stack gets infected, it is bad. Often, when you want your equivalents to have access to local variables, this is indirectly related to ebp, and if it is infected, assume that the game is over. The lecture article has a link to a document that analyzes all of these things and says that we should be very careful when we are tracking taint at such a low level.
Because everything happens very quickly, especially if certain optimizations of the Linux kernel are used to speed up code execution. If this inadvertently results in a stack pointer or interrupt pointer, the pointer becomes infected. And once this happens, you will not be able to do anything useful with the infection tracking system.
Student: how is it embodied at the program level? It looks like we should have all these register files.
Professor:All of these register files are related to the correctness of writing a program. If you are not well versed in the x86 architecture, there will be things that you will definitely miss. I will tell you about what I understand. There are things called Bochs, programs for emulating IBM PC hardware, including x86 architecture processors. In fact, they have something called TaintBochs, which emulates a stream at the x86 system level and works as an interpreter. It will take all of your OS and all your applications and will examine every x86 instruction, trying to imitate what the appropriate hardware should do. You can imagine that this will happen very, very slowly. The good thing about this is that no physical hardware support is required, and
Course MIT "Security of computer systems." Lecture 21: "Tracking data", part 3
Full version of the course is available here .
Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to friends, 30% discount for Habr's users on a unique analogue of the entry-level servers that we invented for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $ 20 or how to share the server? (Options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).
VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps until January for free if you pay for a period of six months, you can order here .
Dell R730xd 2 times cheaper? Only here2 x Intel Dodeca-Core Xeon E5-2650v4 128GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about How to build an infrastructure building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny?