Legacy binaries that do not have source code remain a vital part of our software ecosystem. Lifting and recompilation of legacy binaries allows for a wide range of late program transformations such as security hardening, deobfuscation, and reoptimization even when the source code is unavailable. Existing binary lifting approaches are based on static binary disassembly which has several limitations. Distinguishing code from data statically, for example, is undecidable in the general case. Static disassembly must rely on heuristics and assumptions to disassemble binaries in the absence of dynamic information and high-level language semantics. Consequently, static disassembly cannot reliably handle indirect jumps, inline assembly, and obfuscated code. Lifting approaches that rely on static disassembly, therefore, often produce unsound binaries.
Dynamic disassembly of binaries can circumvent the limitations of static disassembly including the ability to handle obfuscated and encrypted binary code. In this dissertation, we present BinRec, a new approach to heuristic-free binary recompilation which lifts dynamic traces of binaries to a compiler-level intermediate representation (IR); the lifted IR is lowered to a "recovered'' binary by taking advantage of the existing compiler toolchain. Our approach allows applying rich program transformations, such as compiler-based hardening and optimization passes, on top of the recovered representation. We identify and address a number of challenges in binary lifting, including unique challenges posed by our dynamic approach.
In contrast to existing frameworks, our dynamic front end can accurately disassemble and lift binaries without heuristics, and we can successfully recover all SPEC INT 2006 benchmarks including C++ applications. We evaluate our approach in four application domains: I) binary reoptimization, ii) deobfuscation (by recovering partial program semantics from virtualization-obfuscated code), iii) binary hardening (by applying existing compiler-level passes such as AddressSanitizer and SafeStack on binary code), and iv) attack surface reduction in the recovered binary (by removing unused program paths).