Why don't people use formal methods?

Original author: Hillel Wayne
  • Transfer
On the Software Engineering Stack Exchange, I saw the following question : “What prevents the widespread introduction of formal methods?” The question was closed as biased, and most of the answers were comments like “Too expensive !!!” or “The site is not an airplane !!!” In a sense, this is true, but it explains very little. I wrote this article to give a broader historical picture of formal methods (FM), why they are not really used and what we are doing to correct the situation.

Before you begin, you need to formulate some conditions. In fact, there are not many formal methods: just a few tiny groups . This means that different groups use the terms differently. In a broad sense, there are two groups of formal methods:the formal specification studies the writing of accurate, unambiguous specifications, and the formal verification examines the  methods of proof. This includes both code and abstract systems. Not only do we use different terms for code and systems, we often use different tools to verify them. To confuse everything even more, if someone says that he creates a formal specification, this usually means design verification. And if someone says that he does a formal verification, it usually refers to the verification of the code.

For clarity, we divide the verification for code verification (CV) and design verification (DV) and in the same way we divide the specifications for CS and DS. Such terms are not commonly used in the broad FM community. Let's start with CS and CV, then move on to DS and DV.

In addition, partial verification is possible , where only a subset of the specification is checked, or complete verification . This may be the difference between the proofs of the statements “the system never fails and does not accept the wrong password” or “the system never fails and blocks the account if you enter the wrong password three times”. Basically, we will assume that we are doing a full check.

You should also clarify the type of software that we formalize. Most people implicitly distinguish highly reliable programs, such as medical devices and airplanes. People assume that for them formal methods are widely used, but for the rest they are not needed. This is an overly optimistic view: most highly reliable software does not use formal methods. Instead, we will focus on “ordinary” software.

Finally, a disclaimer: I am not a professional historian, and although I tried to carefully check the information, there may be errors in the article. In addition, I specialize in formal specification (DS and DV), so there is more chance of error where I talk about code verification. If you see, write to me, I will correct (and again: I earn money at seminars on TLA + and Alloy, so I am very biased about these languages; I try to be as objective as possible, but you know: bias is bias).

Formal programming

BOM specification

Before you prove the correctness of the code, you need to get the standard of truth. This means some specification of what the code should do. We must know for sure that the result conforms to the specification. It is not enough to say that the list is “sorted”: we do not know what we are sorting, what criteria we use and even what we mean by “sorting”. Instead, you can say: "The list of integers is lsorted in ascending order for any two indices i and j, if i < j, then l[i] <= l[j]."

Code specifications are divided into three main types:

  1. The first is to write code-independent operators. We write our sort function, and in a separate file we write the theorem “this returns sorted lists”. This is the oldest form of the specification, but Isabelle and ACL2 still work (ML was invented specifically to help write such specifications).
  2. The second implements the specification in the code in the form of pre- and postconditions, statements and invariants. A postcondition can be added to the function that “the return value is a sorted list”. Assertion based specifications were originally formalized as Hoare’s logic . They first appeared in the Euclid programming language in the early 1970s (it is unclear who first started using them: Euclid or SPV , but as far as I know, Euclid was previously presented to the public). This style is also called contract programming - the most popular form of verification in the modern industry (here contracts are used as code specifications).
  3. Finally, there is a type system. According to the Curry-Howard correspondence, any mathematical theorem or proof can be encoded as a dependent type. We define the “sorted lists” type and declare a type for the function [Int] -> Sorted [Int].

On Let's Prove Leftpad you can see how it looks. HOL4 and Isabelle are good examples of the specifications of the “independent theorem”, SPARK and Dafny - of the specifications of the “nested statement”, and Coq and Agda are of the “dependent type”.

If you take a closer look, these three forms of code specification are compared with the three main areas of automatic validation: tests, contractsand types. This is no coincidence. Correctness is a wide range, and formal verification is one of its extremes. As the severity (and effort) of verification decreases, we obtain simpler and narrower checks, be it a restriction of the state space under study, the use of weaker types, or a forced check at run time. Then any means of complete specification becomes a means of partial specification, and vice versa: many consider Cleanroom a formal verification technique, where the code review goes far beyond human capabilities.

Which specifications are correct?

Verification verifies that the code conforms to the specification. The question arises: how do we know that we have the correct specification? Finding the right specification is one of the biggest problems in formal methods. This is also one of the main objections against them. But skeptics mean here not exactly what the experts on formal methods mean.

When outsiders ask, “How do you get the right specs?”, They usually think about validationthat is, specifications that do not meet customer requirements. If you formally prove that your code sorts the list, and the client actually wants Uber for soups (tm), you just spent a lot of time. They say that only fast iterations and short feedback cycles can confirm your requirements.

It is true that code verification does not validate it. But there are two problems with this argument. The first is that the stage of applying formal methods is simply transferred, but does not disappear completely. After all these quick iterations, you probably already have an idea of ​​what the client wants. And then start the verification code. Secondly, although we do not know what exactly the client wants, but we can assume that he definitely doesn’twants. For example, to software accidentally fell. They do not need security holes. Everyone agrees with this: after all, no one says that you need to skip the unit tests during iterations. So at least make sure that your version control system does not delete random user data (note: do not think that I am embittered or something like that).

The problem with finding the right specification is more fundamental: we often don’t know what to write there . We think of our requirements in human, not mathematical, terms. If I say: “The program should distinguish trees from birds”, then what is this about? I can explain to a person by showing a bunch of photos of trees and birds, but these are only specific examples, not a description of the idea.. In fact, to translate this into a formal specification, you need to formalize human concepts, and this is a serious problem.

Do not misunderstand me. Here you can determine the relevant specifications, and experts do it all the time. But writing the relevant specifications is a skill that needs to be developed, as well as programming skills. That is why many of the recent successes of code verification are explained by a clear mapping of what we want into what we can express. For example, CompCert  is a formally verified C compiler. Specification for it: “Avoid compilation errors”.

But all this has nothing to do with verification. When you have a specification, you still need to prove that the code corresponds to it.

Proof of specification

The first ever code verification tool is a Dijkstra-style method “think about why this is true,” which is primarily intended for ALGOL. For example, I can “prove” the correctness of sorting by the method of insertion as follows:

  1. Basic option : if one item is added to the empty list, it will be the only item, so it will be sorted.
  2. If we have a sorted list with k elements and add one element, then we insert the element so that it is after all the smaller numbers and in front of all the big numbers. This means that the list is still sorted.
  3. By induction, sorting by insert method will sort the entire list.

Obviously, in reality, the evidence will look more strict, but this is a general idea. Dijkstra and others used this style to prove the correctness of many algorithms, including many of the basics of parallelism. This is also the style with which Knut’s words are related : “Beware of errors in this code; I only proved that it is correct, but did not run it. ” You can easily spoil the mathematical proof so that no one will notice. By some estimates , approximately 20% of the published mathematical evidence contains errors. At Peter Guttmann have an excellent essay on evidence of efficiency ridiculous program, where tons of "trusted" code immediately fall if they are to run.

At the same time, we studied the methods of automatic proof of mathematical theorems. The first program to prove the theorem was published in 1967 . In the early 1970s, such programs began to be used to check Pascal code, and in the middle of the decade special formal languages ​​appeared. The programmer formulates certain properties of the code, and then creates a testable proof that the code has these properties. The first programs for proving theorems simply helped people to verify evidence, and more complex tools could independently prove parts of the theorem.

Which leads to the next problem.

Evidence is hard to get

The evidence is hard and this is a very unpleasant job. It's hard to “quit programming and go to the circus.” Surprisingly, formal proofs of code are often more rigorous than proofs that most mathematicians write! Mathematics is a very creative activity, where the answer to the theorem is valid only if you show it. Creativity does not fit well with formalism and computers.

Take the same example with the sorting method of the insert, where we applied induction. Any mathematician will immediately understand what induction is, how it works in general, and how it works in this case. But in the program to prove the theorem, everything must be strictly formalized. The same with proof by contradiction, proof through contraposition, etc. In addition, it is necessary to formalize all assumptions, even those where most mathematicians do not bother themselves with proofs. Like whata + (b + c) = (a + b) + c. The program for testing theorems does not know a priori that this is true. You either have to prove it (difficult), or declare it as truth according to the associative law of addition (dangerous), or buy a library of theorems from someone who has already been able to prove it (expensively). Early programs for proving theorems competed in terms of the number of embedded proof tactics and the associated libraries of theorems. One of the first widespread SPADE programs presented the full arithmetic library as the main advantage.

Next, you need to get the proof itself. You can assign it to the program or write it yourself. Usually the automatic determination of evidence is not solvable. For extremely narrow cases, such as propositional logic or HM type checking, it is “only” NP-complete. In fact, we ourselves write most of the evidence, and the computer verifies its correctness. This means that you need to know well:

  • mathematics;
  • computer science;
  • the area in which you work: compilers, hardware, etc .;
  • nuances of your program and specialization;
  • nuances of the program for proving theorems that you use, which in itself is a whole specialty.

To make matters worse, computer-specific inserts put sticks in the wheels. Remember, I said that it is dangerous to assume the associative law of addition? Some languages ​​do not comply with it. For example, in C ++ INT_MAX. ((-1) + INT_MAX) + 1 it INT_MAX. -1 + (INT_MAX + 1)is what is undetectable. Assuming associative addition in C ++, then your proof will be wrong, and the code will be broken. You need to either avoid this statement, or prove that there is never an overflow for your particular fragment.

You can say that indefinite addition is a mistake, but you need to use a language with unrelated integers. But most languages ​​have specific functions that interfere with evidence. Take the following code:

a = true;
b = false;
assert a;

It's always like this? Is not a fact. Maybe fchange a. Maybe it will change the parallel flow. It is possible that an balias has been assigned a, so changing it will also change a(note: aliases hinder writing proofs so much that John C. Reynolds had to create a completely new separation logic to cope with this problem). If something is possible in your language, then you should explicitly prove that this is not happening here. This will help clean code, in another case, it can destroy the evidence, because it forces the use of recursion and functions of a higher order. By the way, the one and the other is the basis for writing good functional programs. What is good for programming is bad for proof! (Note: inIn this lecture, Edmund Clark listed some difficult-to-test properties: “floating commas, strings, user-defined types, procedures, concurrency, generic templates, storage, libraries ...”).

Formal verifiers have a dilemma: the more expressive the language, the harder it is to prove something on it. But the less expressive the language, the more difficult it is to write on it. The first working formal languages ​​were very limited subsets of more expressive languages: ACL2 was a subset of Lisp, Euclid was a subset of Pascal, etc. And nothing that we have discussed so far really proves real programs, these are only attempts to approach to writing evidence.

The evidence is hard. But they become easier. Researchers in this area add new heuristics, libraries of theorems, pre-tested components, etc. Technical progress also helps: the faster the computers, the faster the search.

SMT revolution

One of the innovations in the mid-2000s was the inclusion of SMT solvers in programs to prove theorems. In a broad sense, an SMT solver can turn (some) mathematical theorems into constraint observance problems. This turns a creative task into a computational one. You may still need to supply it with intermediate problems (lemmas) as steps in a theorem, but this is better than proving everything yourself. The first SMT solvers appeared around 2004, first as academic projects. A couple of years later, Microsoft Research released Z3, a ready-made SMT solver. The big advantage of the Z3 was that it became much more comfortable to use than other SMTs, which, frankly, said almost nothing. Microsoft Research used it internally to help prove the properties of the Windows kernel

The SMT struck the FM community because it suddenly made many simple proofs trivial and allowed to tackle very difficult issues. Thus, people could work in more expressive languages, since now the problems of expressive statements began to be solved. Incredible progress is evident in the IronFleet project : using the best SMT solvers and advanced verification language, Microsoft was able to write 5000 lines of verified Dafny code in just 3.7 person-years! This is an incredibly fast pace: as many as four point line in the day (Note: The previous record belonged to SEL4 at , developers who have written on two lines in the day at C .

The evidence is difficult.

Why do you need it?

It's time to step back and ask: "What is the point?" We are trying to prove that some program meets a certain specification. Correctness is a range. There are two parts of verification: how objectively your program is “correct” and how thoroughly have you checked the correctness. Obviously, the more verified, the better, but checking costs time and money. If we have several limitations (performance, time to market, cost, etc.), a complete validation check is not necessarily the best option. Then the question arises, what is the minimum check we need and what it costs. In most cases, you need, for example, 90% or 95% or 99% correctness. Maybe you should spend time improving the interface, rather than checking the remaining 1%?

Then the question: "Is the verification 90/95/99% significantly cheaper than 100%?" The answer is "yes." It is quite comfortable to say that the code base, which we have tested and typed well, is basically correct except for a few fixes in production, and we even write more than four lines of code per day. In fact, the vast majority of failures in the operation of distributed systems could have been prevented by a little more complete testing. And this is just an extension of tests, not to mention fuzzing, property-based testing, or model testing. You can get a truly outstanding result with these simple tricks without needing to get full proof.

What if typing and tests do not provide sufficient verification? It's still much easier to go from 90% to 99% than from 99% to 100%. As mentioned earlier, Cleanroom is a developer practice that includes comprehensive documentation, thorough flow analysis and extensive code review. No evidence, no formal verification, not even unit testing. But a properly organized Cleanroom reduces the density of defects to less than 1 bug per 1000 lines of code in production (note: figures from Stevely's research in Toward Zero-Defect Programming > but always better be skeptical and check the original sources). Programming Cleanroom does not slow down the pace of development, and certainly goes faster than 4 lines per day. And Cleanroom itself is just one of the many methods for developing highly reliable software that lies in the middle between normal development and code verification. You do not need complete verification to write good software or even almost perfect. There are cases when it is necessary, but for most industries it is a waste of money.

However, this does not mean that formal methods are generally uneconomical. Many of the aforementioned highly reliable methods are based on writing code specifications that you do not formally prove. Regarding verification, there are two general ways in the industry to benefit. First, design verification instead of code, which we will look at later. Secondly, partial verification of the code, which we will consider now.

Partial code verification

For everyday tasks, it is too expensive to do a full scan. How about partial? After all, you can benefit from the proof of some properties of individual code fragments. Instead of proving that my function always correctly sorts numbers, I can at least prove that it does not loop forever and never goes out of range. This is also very useful information. So, even the simplest evidence for C programs is a great way to eliminate a huge part of indefinite behavior . Availability

problem. Most languages ​​are designed either for full verification, or do not support it at all. In the first case, you lack many good functions from more expressive languages, and in the second case, you need a way to prove things in a language that is hostile to the concept itself. For this reason, most partial verification studies focus on several high-priority languages, such as C and Java. Many work with limited subsets of languages. For example, SPARK is a limited subset of Ada, so you can write critical SPARK code and interact with less critical Ada code. But most of these languages ​​are pretty niche.

More often, certain types of verification are embedded in the basic structure of languages. Typing systems used in production are a common type: you may not know that tail always returns tail, but you know for sure what its type is [a] -> [a]. There are also such examples as Rust with proven memory security and Pony with proof of security on exceptions. They differ slightly from SPARK and Frama-C in that they are capable of performing only partial checks. And they are usually developed by experts in programming languages, rather than experts in formal methods: there is much in common between the two disciplines, but they are not identical. Perhaps that is why languages ​​such as Rust and Haskell are really suitable for use in practice.

Design Specification

So far, we have only talked about code verification. However, formal methods have another side, which is more abstract and verifies the design itself, the project architecture. This is such an in-depth analysis that it is synonymous with formal specification : if someone says that he fulfills a formal specification, it most likely means that he defines and verifies the design.

As we have said, to prove all the lines of the code is very, very difficult. But many problems in production do not arise because of the code, but because of the interaction of the system components. If we dismiss implementation details, for example, taking for granted that the system is able to recognize birds, then it will be easier for us to analyze how trees and birds fit into the overall design as high-level modules. On such a scale, it becomes much easier to describe things that you would not be able to implement, for example, a runtime environment, human interactions, or a ruthless flow of time . On this scale, we formalize our intentions using a common system, not lines of code. This is much closer to the human level, where projects and requirements can be much more ambiguous than at the code level.

For example, let's take a procedure with a coarse specification “if it is called, it makes a system call to save data to the repository and handles system errors”. Properties are difficult to verify, but it is quite clear how to do this. Is the data serialized correctly? Are our guarantees violated due to incorrect input? Do we handle all possible ways to fail the system call? Now compare the high level logging system with the “all messages are logged” specification and answer the following questions:

  • Are all messages or all messages that enter the system recorded ? Messages are recorded once or guaranteed once?
  • How are the messages sent? Is it a turn? Does the channel deliver them only once? Is everything ok with delivery?
  • By logging, do we mean writing forever? Can I put a message in the log, and then delete it from there? Can it “hang out” between the recorded and unrecorded states before it is recorded permanently?
  • What should I do if the server explodes in the middle of recording a message? Is journaling necessary?
  • Are there important carrier properties? Does the fact that the “carrier lost data” goes beyond our requirements or not?
  • What about the GDPR?
  • and so on and so forth.

Without formal design it is more difficult to express the really necessary requirements for the system. If you cannot express them, then you have no idea whether the design really meets the requirements or just looks like them, but can lead to unpredictable consequences. Expressing our intentions and design more formally, we can easily verify that we are actually developing what we need.

Just as we use programming languages ​​to represent code, we use specification languages to represent projects. Specification languages ​​are usually focused on design specifications, rather than code specifications, although some languages ​​can be used for both cases (note: the process of matching design specifications with code specifications is called refining). In the future, I will call the specification languages ​​the design languages ​​(DL) to minimize confusion (again, this is not common terminology; most people use the “specification language”, but I want to clearly distinguish between code specifications and design specifications).

Probably the first complete DL was the VDM around 1972. Since then, we have seen a huge variety of different language specifications. The DL space is much more diverse and fragmented than with code verification languages ​​(CVL). Roughly speaking, people invented DL as a means of achieving a goal, and CVL as a goal itself. Since they are strongly influenced by specific problem areas, all sorts of ideas and semantics are implemented in DL. Here is a very, very brief overview of some of the first DLs:

TongueModeling areaMeans
Zbusiness requirementsrelational algebra
Harel state diagramscontrollersautomatons
Decision Tablessolutionstables

Since DLs are usually created to solve specific problems, most of them have at least two or three real applications. As a rule, the results are very positive. Practitioners say that DL provides insight into the problems and makes finding solutions easier. For a long time, the main champion was Praxis (now Altran), who used “fixes-by-design” —a  combination of Z-designs and SPARK code — to create critical security systems. Specifications save time and money because you will not find design errors at a later stage of the project.

Pamela Zave experimented with Alloy and discovered a fundamental bug in Chord, one of the main distributed hash tables. AWS began to find 35-step critical errorsby writing a TLA + specification. In my experience, when people try to write specifications, they very soon become big fans of this business.

But fans of formal methods and outsiders have a completely different assessment of the value of writing specifications. For fans, the biggest advantage is that the design process itself makes you understand what you are writing. When you need to formally express what your system is doing, then a host of implicit errors suddenly become painfully obvious. Outsiders cannot understand this at all. If you want to give someone a try on the DL, the person must have a way to verify that the design really has the properties he wants.

Fortunately, this is also extremely important for many qualifiers, so design verification is a large area of ​​research.

Model checking

As with the code, we can check the design by writing theorems. Fortunately, we have another trick: you can use the model checker (model checker). Instead of drawing up evidence that the design is correct, we just brute force the space of possible states and see if there is an incorrect state in it. If we don’t find anything, then everything is in order (note: model validation programs are also used to verify code, like JMBC, but model verification is much more often used in design verification than model verification).

There are many advantages to checking models. First, do not need to write evidence that saves a lot of time and effort. Secondly, do not need to learnwrite evidence, so the entry barrier is much lower. Third, if the design is broken, checking the model will give an explicit counterexample. It is much, much easier to correct the error, especially if you need 35 steps to play it. Try to find it yourself.

There are a couple of drawbacks. First, these tools are not as powerful. In particular, you may encounter an infinite(unbounded) model, which has an infinite number of different states. For example, you define a message queue handler: it normally works with a list of ten messages. But if you need to believe it on any list ... well, there are an infinite number of them, so there is an infinite number of states in the model. Most model validation tools have different techniques for such situations, such as determining equivalence classes or symmetries, but each case is different.

Another big drawback - the explosion in the space of states (state-space explosion). Imagine that you have three processes, each of which consists of four consecutive steps, and they can alternate steps in any way. If they do not affect each other's behavior, it turns out(4*3)! / (4!)^3 = 34 650possible performances (behaviors). If each process has one of the five initial states, then the total number of behaviors increases to 4,300,000. And the verification of the models must make sure that all of them behave well. And it provided that they do not interact with each other! If they start doing this, the state space will grow even faster. Combinatorial explosion is considered as the main problem for model checking, and a lot of work has to be done to solve this problem.

But at the same time, there is another way to cope with the explosion of the space of states: throw more equipment at it. The biggest problem for checking the model is “only” the performance problem, and we solve the performance problems very well. Most (but not all) model checks are easily parallelized. After optimizing the model and testing it with small parameters, you can deploy an AWS cluster and run it with large parameters.

In practice, many qualifiers use model checking, and then, as necessary, move to programs to prove theorems (note: keep in mind that “many specifiers” are about ten people). Even more specialists in the preparation of specifications start checking models, and when it has reached the limit of its capabilities, they switch to less intensive forms of verification.

Problem with design specifications

Thus, design verification is simpler and faster than code verification, and has shown many impressive successes. So why don't people use it? The problem with DV is much more insidious. Code verification is a technical problem, and design verification is a social problem: people just do not see the point.

This is largely due to the fact that design is not code . In most DLs, there is no automatic way to generate code, and there is no way to take existing code and check it for design (note: code generation from specifications is called synthesis ; for guidance, see Nadya Polykarpova’s work ; proof that the code conforms to the specification (or another specification) is calledclarification : in both directions are active research).

Programmers tend not to trust software artifacts that are not code or are not in sync with code forcibly. For the same reason, documentation, comments, charts, wikis and commits are often ignored.

It seems that programmers simply do not believe that there is any benefit from the specifications. In my experience, they assume that current tools (pseudocode, diagrams, TDD) are more than enough for proper design. I do not know how common this opinion is and I can not explain it except for general conservatism.

But there are exactly such complaints in every community of the methodology I know: TDD supporters complain that programmers do not want to try TDD, Haskell fans complain that programmers do not think about static typing, and so on.

I heard the argument that Agile does not accept a pre-designed design, so no one wants to make a formal specification. May be. But many of those whom I met reject both Agile and FM. Another argument is that historically formal methods are constantly revalued and did not fulfill the promise. This is also possible, but most people have not even heard of formal methods, much less about their history.

It’s just very difficult to get people to worry about what they don’t do, even if they recognize the benefits.


Code verification is a difficult task. More people are getting involved in this, although theorem-proving programs and SMT solvers are becoming more and more complex. But still in the foreseeable future, it will probably remain a lot of specialists.

Design verification is much easier, but using it requires overcoming the cultural barrier. I think the situation can be changed. Twenty years ago, automated testing and code review were rather exotic and niche topics, but eventually became mainstream. Again, contract programming was niche and remains so.

I hope this article explains a little better why formal methods are so rarely used. At least, this is a better explanation than the usual “web is not a plane” argument. Feel free to shout if you see any obvious mistakes.

Also popular now: