MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization

2017 
The key challenge of software reverse engineering is that the source code of the program under investigation is typically not available. Identifying differences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics-based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based binary diffing suffers from significant performance slowdown, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to mitigate the high overhead of semantics-based binary diffing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an union-find set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our technique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra-family comparisons for the purpose of malware lineage inference. Our experimental results show that MalwareHuntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    2
    Citations
    NaN
    KQI
    []