Learning Program-Wide Code Representations for Binary Diffing
- Author(s): Li, Xuezixiang
- Advisor(s): Yin, Heng
- et al.
Binary diffing analysis quantitatively measures the differences between two given binaries and produces fine-grained basic block matching. It has been widely used to enable different kinds of critical security analysis. However, all existing program analysis and learning based techniques suffer from low accuracy, poor scalability, coarse granularity especially on COTS binaries which did not contains complete debug information. On the other hands, some learning based approaches require extensive labeled training data to function, so that precise labelled and representative dataset is needed to obtain great results. To surmount such limitations, in this paper, we come up with a novel learning based code representation generation approach to solve the binary diffing problem. We rely only on the code semantic information as well as the program-wide control flow structural information to generate block embeddings without supporting of any debug information. Furthermore, we propose a K-hop greedy matching algorithm to find the optimal diffing results using the generated block representations. We implement a prototype called DeepBinDiff and evaluate its effectiveness and efficiency with large number of binaries and real-world vulnerabilities. The results show that our tool could outperform the state-of-the-art binary diffing tools by large margin for both cross-version and cross-optimization level diffing. A case study for OpenSSL using real-world vulnerabilities further demonstrates the usefulness of our system.