Identifying Program Entropy Characteristics with Symbolic Execution
The security infrastructure underpinning our society relies on encryption, which relies on the correct generation and use of pseudorandom data. Unfortunately, random data is deceptively hard to generate. Implementation problems in PRNGs and the incorrect usage of generated random data in cryptographic algorithms have led to many issues, including the infamous Debian OpenSSL bug, which exposed millions of systems on the internet to potential compromise due to a mistake that limited the source of randomness during key generation to have 2^15 different seeds (i.e. 15 bits of entropy).
It is important to automatically identify if a given program applies a certain cryptographic algorithm or uses its random data correctly.
This paper tackles the very first step of this problem by extracting an understanding of how a binary program generates or uses randomness. Specifically, we set the following problem: given a program (or a specific function), can we estimate bounds on the amount of randomness present in the program or function's output by determining bounds on the entropy of this output data? Our technique estimates upper bounds on the entropy of program output through a process of expression reinterpretation and stochastic probability estimation, related to abstract interpretation and model counting.