Reliable software is vital to society. Much effort has been spent to ensure the robustness and reliability of the software, including unit testing, model checking, static analysis, etc. However, these approaches do not scale well.
Greybox fuzzing can test the software with little or no human intervention. A greybox fuzzer utilizes a mutator to automatically generate inputs to test the program. Unlike a random input generator, greybox fuzzer also monitors the program behavior to determine if the generated input triggers a new behavior. Inputs that trigger new behaviors are saved for future mutation. This monitoring is simple yet effective in practice. As a result, much work have focused on different parts of the fuzzer to improve its overall performance and applications.
Despite its popularity, some aspects of greybox fuzzing and its applications have not been thoroughly studied. In this thesis, we cover three aspects of greybox fuzzing. First, many fuzzers aim to increase branch coverage. However, high branch coverage is only a sufficient condition for triggering bugs. We revisit some designs of the fuzzing process to increase the likelihood of finding bugs. We first design a tool called Integrity. Integrity sanitizes integer operations within the program, which are harder to spot compared with memory errors. Integrity has discovered eight new integer errors in open-source programs. While randomized fuzzers excel at increasing branch coverage, they struggle with solving predicates set by Integrity. To trigger bugs more effectively, we propose a deterministic fuzzer Valkyrie. Valkyrie uses principled approaches, such as gradient descent and compressed branch coverage, to eliminate the randomness in fuzzers while increasing throughput. Our evaluation shows that Valkyrie can find bugs faster than the state-of-the-art in many cases.
Second, generic fuzzing is often less effective than specialized fuzzing. By incorporating expert knowledge into the fuzzer, a specialized fuzzer can reach deeply nested code more quickly. We select the LLVM backend as a test bed to see if a specialized strategy can find bugs in compilers. We develop IRFuzzer with a tailored mutation and monitoring method customized for the LLVM backend. We model LLVM intermediate representation (IR) so that IRFuzzer guarantees to generate valid input for the LLVM backend. IRFuzzer monitors matcher table coverage to track “behavior” in a more fine-grained manner with little overhead. IRFuzzer has found 78 new bugs in upstream LLVM, with 57 of them fixed, five of which have been backport to LLVM 15. These findings demonstrate that specialized fuzzing provides useful, actionable insights to LLVM developers.
Finally, fuzzers generate large quantities of inputs as a byproduct, which are often discarded after the fuzzing process is completed. These inputs trigger different behaviors of the program. We notice that these behaviors can be vital for training large language models (LLMs). With this observation, we propose using source code coupled with their test cases for LLM training, where each test case is composed of a fuzzer-generated input and its corresponding output. We first build a dataset on top of an existing one by pairing test cases. Then, we develop methods to fine-tune a trained model and pretrain a new model on this dataset. With this new training scheme, wecontribute a new code understanding model, FuzzPretrain. Our evaluation shows that FuzzPretrain yielded more than 6%/19% mean average precision (mAP) improvements on code search over its baseline trained with only source code or abstract syntax trees (AST), respectively.