An investigation of byte n-gram features for malware classification

Edward Raff,Richard Zak,Russell Cox,Jared Sylvester,Paul Yacci,Rebecca Ward,Anna Tracy,Mark McLean,Charles K. Nicholas

An investigation of byte n-gram features for malware classification

2018

Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations