SAMPLES: Self Adaptive Mining of Persistent LExical Snippets for Classifying Mobile Application Traffic

2015 
We present SAMPLES: Self Adaptive Mining of Persistent LExical Snippets; a systematic framework for classifying network traffic generated by mobile applications. SAMPLES constructs conjunctive rules, in an automated fashion, through a supervised methodology over a set of labeled flows (the training set). Each conjunctive rule corresponds to the lexical context, associated with an application identifier found in a snippet of the HTTP header, and is defined by: (a) the identifier type, (b) the HTTP header-field it occurs in, and (c) the prefix/suffix surrounding its occurrence. Subsequently, these conjunctive rules undergo an aggregate-and-validate step for improving accuracy and determining a priority order. The refined rule-set is then loaded into an application-identification engine where it operates at a per flow granularity, in an extract-and-lookup paradigm, to identify the application responsible for a given flow. Thus, SAMPLES can facilitate important network measurement and management tasks --- e.g. behavioral profiling [29], application-level firewalls [21,22] etc. --- which require a more detailed view of the underlying traffic than that afforded by traditional protocol/port based methods. We evaluate SAMPLES on a test set comprising 15 million flows (approx.) generated by over 700 K applications from the Android, iOS and Nokia market-places. SAMPLES successfully identifies over 90% of these applications with 99% accuracy on an average. This, in spite of the fact that fewer than 2% of the applications are required during the training phase, for each of the three market places. This is a testament to the universality and the scalability of our approach. We, therefore, expect SAMPLES to work with reasonable coverage and accuracy for other mobile platforms --- e.g. BlackBerry and Windows Mobile --- as well.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    75
    Citations
    NaN
    KQI
    []