Fuzzing Like It’s 1989

December 31, 2018

Page content

With 2019 a day away, let’s reflect on the past to see how we can improve. Yes, let’s take a long look back 30 years and reflect on the original fuzzing paper, An Empirical Study of the Reliability of UNIX Utilities, and its 1995 follow-up, Fuzz Revisited, by Barton P. Miller.

In this blog post, we are going to find bugs in modern versions of Ubuntu Linux using the exact same tools as described in the original fuzzing papers. You should read the original papers not only for context, but for their insight. They proved to be very prescient about the vulnerabilities and exploits that would plague code over the decade following their publication. Astute readers may notice the publication date for the original paper is 1990. Even more perceptive readers will observe the copyright date of the source code comments: 1989.

A Quick Review

For those of you who didn’t read the papers (you really should), this section provides a quick summary and some choice quotes.

The fuzz program works by generating random character streams, with the option to generate only printable, control, or non-printable characters. The program uses a seed to generate reproducible results, which is a useful feature modern fuzzers often lack. A set of scripts execute target programs and check for core dumps. Program hangs are detected manually. Adapters provide random input to interactive programs (1990 paper), network services (1995 paper), and graphical X programs (1995 paper).

The 1990 paper tests four different processor architectures (i386, CVAX, Sparc, 68020) and five operating systems (4.3BSD, SunOS, AIX, Xenix, Dynix). The 1995 paper has similar platform diversity. In the first paper, 25-33% of utilities fail, depending on the platform. In the 1995 follow-on, the numbers range from 9%-33%, with GNU (on SunOS) and Linux being by far the least likely to crash.

The 1990 paper concludes that (1) programmers do not check array bounds or error codes, (2) macros make code hard to read and debug, and (3) C is very unsafe. The extremely unsafe gets function and C’s type system receive special mention. During testing, the authors discover format string vulnerabilities years before their widespread exploitation (see page 15). The paper concludes with a user survey asking about how often users fix or report bugs. Turns out reporting bugs was hard and there was little interest in fixing them.

The 1995 paper mentions open source software and includes a discussion of why it may have fewer bugs. It also contains this choice quote:

When we examined the bugs that caused the failures, a distressing phenomenon emerged: many of the bugs discovered (approximately 40%) and reported in 1990 are still present in their exact form in 1995. …
The techniques used in this study are simple and mostly automatic. It is difficult to understand why a vendor would not partake of a free and easy source of reliability improvements.

It would take another 15-20 years for fuzz testing to become standard practice at large software development shops.

I also found this statement, written in 1990 to be prescient of things to come:

Often the terseness of the C programming style is carried to extremes; form is emphasized over correct function. The ability to overflow an input buffer is also a potential security hole, as shown by the recent Internet worm.

Testing Methodology

Thankfully, after 30 years, Dr. Barton still provides full source code, scripts, and data to reproduce his results, which is a commendable goal that more researchers should emulate. The scripts and fuzzing code have aged surprisingly well. The scripts work as is, and the fuzz tool required only minor changes to compile and run.

For these tests, we used the scripts and data found in the fuzz-1995-basic repository, because it includes the most modern list of applications to test. As per the top-level README, these are the same random inputs used for the original fuzzing tests. The results presented below for modern Linux used the exact same code and data as the original papers. The only thing changed is the master command list to reflect modern Linux utilities.

Updates for 30 Years of New Software

Obviously there have been some changes in Linux software packages in the past 30 years, although quite a few tested utilities still trace their lineage back several decades. Modern versions of the same software audited in the 1995 paper were tested, where possible. Some software was no longer available and had to be replaced. The justification for each replacement is as follows:

cfe ⇨ cc1: This is a C preprocessor and equivalent to the one used in the 1995 paper.
dbx ⇨ gdb: This is a debugger, an equivalence to that used in the 1995 paper.
ditroff ⇨ groff: ditroff is no longer available.
dtbl ⇨ gtbl: A GNU Troff equivalent of the old dtbl utility.
lisp ⇨ clisp: A common lisp implementation.
more ⇨ less: Less is more!
prolog ⇨ swipl: There were two choices for prolog: SWI Prolog and GNU Prolog. SWI Prolog won out because it is an older and a more comprehensive implementation.
awk ⇨ gawk: The GNU version of awk.
cc ⇨ gcc: The default C compiler.
compress ⇨ gzip: GZip is the spiritual successor of old Unix compress.
lint ⇨ splint: A GPL-licensed rewrite of lint.
/bin/mail ⇨ /usr/bin/mail: This should be an equivalent utility at a different path.
f77 ⇨ fort77: There were two possible choices for a Fortan77 compiler: GNU Fortran and Fort77. GNU Fortran is recommended for Fortran 90, while Fort77 is recommended for Fortran77 support. The f2c program is actively maintained and the changelog records entries date back to 1989.

Results

The fuzzing methods of 1989 still find bugs in 2018. There has, however, been progress.

Measuring progress requires a baseline, and fortunately, there is a baseline for Linux utilities. While the original fuzzing paper from 1990 predates Linux, the 1995 re-test uses the same code to fuzz Linux utilities on the 1995 Slackware 2.1.0 distribution. The relevant results appear on Table 3 of the 1995 paper (pages 7-9). GNU/Linux held up very well against commercial competitors:

The failure rate of the utilities on the freely-distributed Linux version of UNIX was second-lowest at 9%.

Let’s examine how the Linux utilities of 2018 compare to the Linux utilities of 1995 using the fuzzing tools of 1989:

	Ubuntu 18.10 (2018)	Ubuntu 18.04 (2018)	Ubuntu 16.04 (2016)	Ubuntu 14.04 (2014)	Slackware 2.1.0 (1995)
Crashes	1 (f77)	1 (f77)	2 (f77, ul)	2 (swipl, f77)	4 (ul, flex, indent, gdb)
Hangs	1 (spell)	1 (spell)	1 (spell)	2 (spell, units)	1 (ctags)
Total Tested	81	81	81	81	55
Crash/Hang %	2%	2%	4%	5%	9%

Amazingly, the Linux crash and hang count is still not zero, even for the latest Ubuntu release. The f2c program called by f77 triggers a segmentation fault, and the spell program hangs on two of the test inputs.

What Are The Bugs?

There are few enough bugs that I could manually investigate the root cause of some issues. Some results, like a bug in glibc, were surprising while others, like an sprintf into a fixed-sized buffer, were predictable.

The ul crash

The bug in ul is actually a bug in glibc. Specifically, it is an issue reported in this glibc bug report and this debian-glibc mailing list post (another person triggered it in ul) in 2016. According to the bug tracker it is still unfixed. Since the issue cannot be triggered on Ubuntu 18.04 and newer, the bug has been fixed at the distribution level. From the bug tracker comments, the core issue could be very serious.

f77 crash

The f77 program is provided by the fort77 package, which itself is a wrapper script around f2c, a Fortran77-to-C source translator. Debugging f2c reveals the crash is in the errstr function when printing an overly long error message. The f2c source reveals that it uses sprintf to write a variable length string into a fixed sized buffer:

errstr(const char *s, const char *t)
#endif
{
  char buff[100];
  sprintf(buff, s, t);
  err(buff);
}

This issue looks like it’s been a part of f2c since inception. The f2c program has existed since at least 1989, per the changelog. A Fortran77 compiler was not tested on Linux in the 1995 fuzzing re-test, but had it been, this issue would have been found earlier.

The spell Hang

This is a great example of a classical deadlock. The spell program delegates spell checking to the ispell program via a pipe. The spell program reads text line by line and issues a blocking write of line size to ispell. The ispell program, however, will read at most BUFSIZ/2 bytes at a time (4096 bytes on my system) and issue a blocking write to ensure the client received spelling data processed thus far. Two different test inputs cause spell to write a line of more than 4096 characters to ispell, causing a deadlock: spell waits for ispell to read the whole line, while ispell waits for spell to acknowledge that it read the initial corrections.

The units Hang

Upon initial examination this appears to be an infinite loop condition. The hang looks to be in libreadline and not units, although newer versions of units do not suffer from the bug. The changelog indicates some input filtering was added, which may have inadvertently fixed this issue. While a thorough investigation of the cause and correction was out of scope for this blog post, there may still be a way to supply hanging input to libreadline.

The swipl Crash

For completeness I wanted to include the swipl crash. However, I did not investigate it thoroughly, as the crash has been long-fixed and looks fairly benign. The crash is actually an assertion (i.e. a thing that should never occur has happened) triggered during character conversion:

[Thread 1] pl-fli.c:2495: codeToAtom: Assertion failed: chrcode >= 0
C-stack trace labeled "crash":
  [0] __assert_fail+0x41
  [1] PL_put_term+0x18e
  [2] PL_unify_text+0x1c4
…

It is never good when an application crashes, but at least in this case the program can tell something is amiss, and it fails early and loudly.

Conclusion

Fuzzing has been a simple and reliable way to find bugs in programs for the last 30 years. While fuzzing research is advancing rapidly, even the simplest attempts that reuse 30-year-old code are successful at identifying bugs in modern Linux utilities.

The original fuzzing papers do a great job at foretelling the dangers of C and the security issues it would cause for decades. They argue convincingly that C makes it too easy to write unsafe code and should be avoided if possible. More directly, the papers show that even naive fuzz testing still exposes bugs, and such testing should be incorporated as a standard software development practice. Sadly, this advice was not followed for decades.

I hope you have enjoyed this 30-year retrospective. Be on the lookout for the next installment of this series: Fuzzing In The Year 2000, which will investigate how Windows 10 applications compare against their Windows NT/2000 equivalents when faced with a Windows message fuzzer. I think that you can already guess the answer.