NLabel: An Accurate Familial Clustering Framework for Large-scale Weakly-labeled Malware

2020 
Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []