Exploring the execution space is essential to many program analysis tasks such as finding vulnerabilities in the program under test. Starting from a corpus of initial inputs (a.k.a. seeds), automated test generation aims to find more and more inputs (or testcases) that exercise new program states, and hopefully, some inputs will reach interesting states, e.g., triggering a vulnerability. Based on the observation that under-tested code is more likely to have bugs, coverage-guided testing, which tries to maximize the code coverage, works very well in practice.
Notably, coverage-guided testing usually involves three stages: seed selection, seed schedule, and new input generation.This dissertation addresses three core problems in each stage and advances the state-of-the-art on automated test generation.
First, we conduct the first systematic study on the impact of coverage metrics. In particular, we formally define and discuss the concept of sensitivity to compare different coverage metrics. We show that certain program states (e.g., vulnerabilities) cannot be reached without enough sensitivity. We then selectively present several metrics with different sensitivities, and evaluate them on a large set of programs. Results show that each metric has its unique merit in terms of vulnerability finding. However, there is no grand slam one that defeats all the others when the computational resources are limited. Specifically, a high sensitive coverage metric can select too many seeds that overwhelm the current scheduling algorithms and lead to poor performance.
Second, we aim to address the seed explosion problem caused by a sensitive coverage metric. To this end, we model this problem as a trade-off between exploration and exploitation. Then, we design a novel multi-level coverage metric that incorporates sensitive coverage metrics in a novel way. Combined with a reinforcement-learning-based hierarchical seed scheduler, our approach not only can trigger more bugs and achieve higher code coverage, but also can achieve the same coverage faster than existing approaches. However, we also discover that with a large amount of seeds, the input generation stage will become a bottleneck. That is, the likelihood of a (randomly generated) input being selected as a new seed decreases when the input size and the corpus size increase.
One way to improve the efficiency of input generation is to replace random mutation with path constraints solving, e.g., by leveraging concolic execution (CE). Ideally, each input generated by a concolic execution engine should visit a new state and be selected as a new seed. However, our study finds that a considerable amount (as high as 50\%) inputs actually failed to reach new states thus are not selected, especially when handling format inputs. To address this problem, in the last piece of this dissertation, we propose format-aware solving that leverages path constraints to infer input format information, and leverages inferred format information to guide the constructing and solving of path constraints. Evaluation shows that our approach can negate significantly more branches, lead to deeper new paths, and cover more code.