Fuzzing in 1989 style

Transfer

With the advent of 2019, it is good to remember the past and think about the future. Look back 30 years ago and reflect on the first scientific articles on fuzzing: “An Empirical Study of the Reliability of UNIX Utilities” and the subsequent work of 1995 “Revising the Fuzzing” by the same author Barton Miller .

In this article we will try to find bugs in modern versions of Ubuntu Linux, using the same tools., as in the original works on fuzzing. You must read the original documents not only for the context, but also for understanding. They turned out to be very prophetic in regard to vulnerabilities and exploits for decades to come. Attentive readers may notice the date of publication of the original article: 1990. Even more attentive notice copyright in the comments source: 1989.

Short review

For those who have not read the documents (although this should really be done), this section contains a brief summary and some selected quotes.

A fuzzing program generates random streams of characters, with the ability to generate only printed or non-printing characters. It uses a certain initial value (seed), providing reproducible results, which modern fuzzers often lack. A set of scripts runs on the tested programs and checks for the presence of the main dumps. Hangs are detected manually. Adapters provide random input for interactive programs (1990 paper), network services (1995), and graphical X applications (1995).

The 1990 article tests four processor architectures (i386, CVAX, Sparc, 68020) and five operating systems (4.3 BSD, SunOS, AIX, Xenix, Dynix). The 1995 article has a similar selection of platforms. The first article succeeds in achieving a failure of 25-33% of utilities, depending on the platform. In the following article, these figures range from 9% to 33%, with GNU (on SunOS) and Linux having the smallest failure rate.

The 1990 article concludes that 1) programmers do not check the array boundaries or error codes, 2) macros make it difficult to read and debug the code, and 3) C is not very safe. Special mention awarded extremely insecure feature.getsand type C system. During testing, the authors found Format String vulnerabilities years before their mass exploitation. The article ends with a user survey on how often they fix bugs or report bugs. It turned out that it was difficult to report bugs and there was no particular interest in correcting them.

The 1995 article mentions open source software and discusses why it has fewer bugs. Quote:

When we investigated the causes of failures, an alarming phenomenon emerged: many of the bugs (about 40%) that were reported in 1990 are still present in their exact form in 1995. ... The

methods used here are simple and mostly automated. It’s hard to understand why developers don’t use this easy and free source of reliability enhancement.

Only in 15-20 years the fuzzing technique will become standard practice for large vendors.

It also seems to me that this statement in 1990 foresees future events:

Often the brevity of the C programming style is brought to the extreme, the form prevails over the correct function. The ability to overflow the input buffer is a potential security hole, as a recent Internet worm has shown .

Testing Methodology

Fortunately, after 30 years, Dr. Barton still provides the full source code, scripts and data to reproduce his results : a good example to be followed by other researchers. The scripts work without problems, and the fuzzing tool required only minor changes to be compiled and run.

For these tests, we used scripts and input data from the fuzz-1995-basic repository , because there is the latest list of tested applications . According to the README , here are the same random inputs as in the original study. The results below for modern Linux are obtained with exactly the same fuzzing code and input data.as in original articles. Only the list of utilities for testing has changed.

Changes in utilities for 30 years

Obviously, over the past 30 years, there have been some changes in the Linux software packages, although quite a few proven utilities have continued their pedigree for decades. Where possible, we took the modern versions of the same programs from the 1995 article. Some programs are no longer available, we have replaced them. Justification of all replacements:

cfe⇨ cc1: Equivalent to preprocessor C from the 1995 article.
dbx⇨ gdb: 1995 equivalent debugger.
ditroff⇨ groff: ditroffno longer available.
dtbl⇨ gtbl: Equivalent GNU Troff old utilities dtbl.
lisp⇨ clisp: Standard implementation of lisp.
more⇨ less: Less is more!
prolog⇨ swipl: There are two prolog options: SWI Prolog and GNU Prolog. SWI Prolog is preferable because it is an older and more complete implementation.
awk⇨ gawk: GNU version awk.
cc⇨ gcc: Standard C Compiler
compress⇨ gzip: GZip is the ideological heir to the old Unix utility compress.
lint⇨ splint: Transcribed lintunder the GPL license.
/bin/mail⇨ /usr/bin/mail: Equivalent utility in a different way.
f77⇨ fort77: There are two variants of the Fortan77 compiler: GNU Fortran and Fort77. The first is recommended for Fortran 90, and the second is for Fortran77 support. The program is f2cactively supported, its list of changes has been maintained since 1989.

results

The 1989 fuzzing technique still finds errors in 2018. But there is some progress.

To measure progress, you need some kind of base. Fortunately, such a database exists for Linux utilities. Although in the time of the original 1990 article, Linux OS did not exist yet, but the repeated test of 1995 launched the same fuzzing code on utilities from the Slackware 2.1.0 distribution from 1995. The corresponding results are shown in table 3 of the 1995 article (pp. 7-9) . Compared to GNU / Linux commercial rivals, it looks very good:

The percentage of utility crashes in the free Linux version of UNIX was the second highest: 9%.

So, let's compare Linux utilities of 1995 and 2018 with fuzzing tools for 1989:

	Ubuntu 10/18 (2018)	Ubuntu 04/18/018	Ubuntu 16.04 (2016)	Ubuntu 04/14 (2014)	Slackware 2.1.0 (1995)
Failures	1 (f77)	1 (f77)	2 (f77, ul)	2 (swipl, f77)	4 (ul, flex, indent, gdb)
Hanging	1 (spell)	1 (spell)	1 (spell)	2 (spell, units)	1 (ctags)
Total tested	81	81	81	81	55
Failures / freezes,%	2%	2%	four%	five%	9%

Surprisingly, the number of crashes and hangs in Linux is still more than zero, even on the latest version of Ubuntu. So, it f77causes a program f2cwith a segmentation error, and the program spellhangs on two variants of test input data.

What are the bugs?

I was able to manually figure out the root cause of some bugs. Some results, such as a bug in glibc, were unexpected, while others, such as a sprintf with a fixed-size buffer, were predictable.

Ul failure

The error in ul is actually a bug in glibc. In particular, it was reported here and here (another person found it in ul) in 2016. According to the bug tracker, the error is still not fixed. Since the bug cannot be reproduced on Ubuntu 18.04 and newer, it is fixed at the distribution level. Judging by the comments to the bug tracker, the main problem can be very serious.

Failure f77

The program f77comes in the package fort77, which is itself a wrapper script around f2c, translating the source code from Fortran77 to C. Debugging f2cshows that a crash occurs when the function errstrprints an error message that is too long. From the f2c source code, you can see that it uses the sprintf function to write a variable-length string to a fixed-size buffer:

errstr(const char *s, const char *t)
#endif
{
  char buff[100];
  sprintf(buff, s, t);
  err(buff);
}

It seems that this code has been preserved since its inception f2c. The program has been keeping track of changes since at least 1989. In 1995, when re-fuzzing, the Fortran77 compiler was not tested, otherwise the problem would have been found earlier.

Hang spell

A great example of a classic deadlock. The program spelldelegates spelling check to the program ispellthrough the channel. spellreads text line by line and produces a blocking line size entry in ispell. However, it ispellreads a maximum of BUFSIZ/2bytes at a time (4096 bytes on my system) and issues a blocking entry to ensure that the client received the verification data that has been processed so far. Two different test input data were spellwritten to write more than 4096 characters for the line ispell, which led to a deadlock: it spellwaits for it to ispellread the entire line, while it ispellwaits for spellconfirmation to read the original spelling changes.

Freezing units

At first glance, it seems that there is an infinite loop condition. The hang seems to be in libreadline, not in units, although newer versions unitsdo not suffer from this error. The change log indicates that input filtering was added, which could accidentally fix this problem. At the same time a thorough investigation of the causes is beyond the scope of this blog. Perhaps the way to hang libreadlinestill remained.

Crash swipl

For the sake of completeness, I want to mention a crash swipl, although I have not carefully studied it, since the bug has long been fixed and seems to be quite high quality. Failure is actually an assertion (that is, something that should never happen), which is called when converting characters: An abnormal termination is always bad, but here at least a program can report an error, collapsing early and loudly.

[Thread 1] pl-fli.c:2495: codeToAtom: Assertion failed: chrcode >= 0

C-stack trace labeled "crash":

 [0] __assert_fail+0x41

 [1] PL_put_term+0x18e

 [2] PL_unify_text+0x1c4

…

Conclusion

For the past 30 years, fuzzing has remained a simple and reliable way to find bugs. Although active research is underway in this area , even a 30-year-old fuzzer successfully finds errors in modern Linux utilities. The author of the original articles predicted the security problems that C would cause in the coming decades. He convincingly proves that C is too easy to write unsafe code and should be avoided if possible. In particular, the articles demonstrate: bugs appear even with the simplest phasing, and such testing should be included in the standard practice of software development. Unfortunately, this advice has not been followed for decades.

Hope you enjoyed this 30 year old retrospective. Look forward to the next article, Fuzzing 2000, where we examine how robust Windows 10 applications are compared to their Windows NT / 2000 equivalents when tested with a fuzzer . I think the answer is predictable.

Tags: